, 1 min read
Parallel Mass-File Processing
Original post is here eklausmeier.goip.de/blog/2021/10-02-parallel-mass-file-processing.
Task at hand: Process ca. 400,000 files. In our case each file needed to be converted from EBCDIC to ASCII.
Obviously, you could do this sequentially. But having a multiprocessor machine you should make use of all processing power. The chosen approach is as follows:
- Generate a list of all files to be processed, i.e., file with all filenames, henceforth called
fl
. For example:find . -mindepth 2 > fl
- Split
fl
into 32 parts ("chunks"):split -nl/32 fl fl\.
- Each chunk is now processed in parallel:
for i in fl.??; do processEachChunk $i & done
In our case each file is processed as below, i.e., processEachChunk
looks like:
T=/tmp/mvscvtInp.$$
while read fn; do
#echo $fn
if [ -f $fn ]; then
mv "$fn" $T || echo "Error: i=|$fn|, T=|$T|"
mvscvt -a < $T > "$fn"
fi
done
Here mvscvt
is the homegrown program to convert EBCDIC to ASCII. If your EBCDIC files are not special in any way then you can use
dd conv=ascii if=...
instead of mvscvt
.
If possible, i.e., if all data fits into main memory, do this operation on a RAM disk. On Arch Linux /tmp
is mounted as tmpfs
, i.e., a RAM disk.
Added 13-Apr-2023: Alternative route. Split into 64 files with all the filenames:
find . -type f > fl
split -nl/64 fl flsp
Now run a program, which can handle multiple arguments, and therefore does not need to be started over and over again.
for i in flsp*; do mvscvt -ar `cat $i` & done