ETL programs like Ab Initio
know how to tell parallel processes to split up big files and process
each part separately, even when the files are linefeed delimited
(they all agree to search up (or down) for the dividing linefeed
closest to N bytes down file). Does anyone know of a utility that can
split a file this way (without reading it sequentially)? Is this in gnu parallel?
It'd be nice to be able to take a list of mixed size files and divide them by size into N chunks of approximately equal lines, estimated using byte sizes and with an algorythm for searching for the record delimiter (linefeed) such that no records are lost. Sort of a mixed input leveller for parallel loads. If it is part of parallel, then parallel can launch processing for each chunk and to combine the chunks.