, 1 min read

Parallel Mass-File Processing

Original post is here eklausmeier.goip.de/blog/2021/10-02-parallel-mass-file-processing.


Task at hand: Process ca. 400,000 files. In our case each file needed to be converted from EBCDIC to ASCII.

Obviously, you could do this sequentially. But having a multiprocessor machine you should make use of all processing power. The chosen approach is as follows:

  1. Generate a list of all files to be processed, i.e., file with all filenames, henceforth called fl. For example: find . -mindepth 2 > fl
  2. Split fl into 32 parts ("chunks"): split -nl/32 fl fl\.
  3. Each chunk is now processed in parallel: for i in fl.??; do processEachChunk $i & done

In our case each file is processed as below, i.e., processEachChunk looks like:

T=/tmp/mvscvtInp.$$
while read fn; do
    #echo $fn
    if [ -f $fn ]; then
        mv "$fn" $T  ||  echo "Error: i=|$fn|, T=|$T|"
        mvscvt -a < $T > "$fn"
    fi
done

Here mvscvt is the homegrown program to convert EBCDIC to ASCII. If your EBCDIC files are not special in any way then you can use

dd conv=ascii if=...

instead of mvscvt.

If possible, i.e., if all data fits into main memory, do this operation on a RAM disk. On Arch Linux /tmp is mounted as tmpfs, i.e., a RAM disk.

Added 13-Apr-2023: Alternative route. Split into 64 files with all the filenames:

find . -type f > fl
split -nl/64 fl flsp

Now run a program, which can handle multiple arguments, and therefore does not need to be started over and over again.

for i in flsp*; do mvscvt -ar `cat $i` & done