bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Read a fixed length of input each time


From: Neil R. Ormos
Subject: Re: Read a fixed length of input each time
Date: Tue, 23 Jun 2020 14:04:05 -0500 (CDT)

Andrew J. Schorr wrote:
> Neil R. Ormos wrote:

>> I've used this for a few different things.  I
>> don't suggest these use cases justify any
>> changes to gawk or extensions.

>> 1.  Detecting file type. [...]

>> 2.  Extracting version information from Andoid APK
>>     files on systems where Android Asset Packaging
>>     Tool is not available.

>> 3.  Detecting groups of files having common
>>     initial chunks of N bytes.  There are a few
>>     different applications for this. One is
>>     identifying probably essentially-duplicate
>>     media files--e.g., video or audio files
>>     that have the same substantive content and
>>     differ only in metadata placed near the end
>>     of the file.  Although there may be
>>     "better" ways to do it using the shell or
>>     common utilities, the function of those
>>     utilities can vary by platform, and if you
>>     will need to process some of the content of
>>     the file, orchestrating a shell pipeline
>>     may not be more convenient or efficient.

> Do these all rely upon examining the beginning
> of the file?  [...]

Yes, but not exclusively.  For example, you might want to inspect the first 
10^6 bytes, calculate a hash, and then extract some metadata near the end of 
the file.

> If so, could one instead use "head
> -c <n>" to read the first <n> bytes into gawk?

Maybe.  I intended "better... using the shell or common utilities" to allude to 
that possibility.  But there are environments where head(1) does not recognize 
the -c option.  I guess dd(1) is your friend in that case.

Also, consider a situation where the gawk program is non-trivially processing a 
list of files.  There are situations where you might need a little more than 
just:

   cat list-of-files | xargs -i head -c 2048 '{}' | gawk '{ something }'

I guess you could build a pipeline including "head -c" in gawk, and then read 
the results, but I'm not sure how that's better than just reading the file 
directly in gawk.

Again, I'm not urging any changes. I was just explaining to the OP how he might 
use gawk's existing capabilities to read binary data in fixed-sized chunks, and 
then responding to your request for use cases.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]