[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Read a fixed length of input each time
From: |
Neil R. Ormos |
Subject: |
Re: Read a fixed length of input each time |
Date: |
Tue, 23 Jun 2020 14:04:05 -0500 (CDT) |
Andrew J. Schorr wrote:
> Neil R. Ormos wrote:
>> I've used this for a few different things. I
>> don't suggest these use cases justify any
>> changes to gawk or extensions.
>> 1. Detecting file type. [...]
>> 2. Extracting version information from Andoid APK
>> files on systems where Android Asset Packaging
>> Tool is not available.
>> 3. Detecting groups of files having common
>> initial chunks of N bytes. There are a few
>> different applications for this. One is
>> identifying probably essentially-duplicate
>> media files--e.g., video or audio files
>> that have the same substantive content and
>> differ only in metadata placed near the end
>> of the file. Although there may be
>> "better" ways to do it using the shell or
>> common utilities, the function of those
>> utilities can vary by platform, and if you
>> will need to process some of the content of
>> the file, orchestrating a shell pipeline
>> may not be more convenient or efficient.
> Do these all rely upon examining the beginning
> of the file? [...]
Yes, but not exclusively. For example, you might want to inspect the first
10^6 bytes, calculate a hash, and then extract some metadata near the end of
the file.
> If so, could one instead use "head
> -c <n>" to read the first <n> bytes into gawk?
Maybe. I intended "better... using the shell or common utilities" to allude to
that possibility. But there are environments where head(1) does not recognize
the -c option. I guess dd(1) is your friend in that case.
Also, consider a situation where the gawk program is non-trivially processing a
list of files. There are situations where you might need a little more than
just:
cat list-of-files | xargs -i head -c 2048 '{}' | gawk '{ something }'
I guess you could build a pipeline including "head -c" in gawk, and then read
the results, but I'm not sure how that's better than just reading the file
directly in gawk.
Again, I'm not urging any changes. I was just explaining to the OP how he might
use gawk's existing capabilities to read binary data in fixed-sized chunks, and
then responding to your request for use cases.
- Re: Read a fixed length of input each time, (continued)
- Re: Read a fixed length of input each time, Neil R. Ormos, 2020/06/23
- Re: Read a fixed length of input each time, Andrew J. Schorr, 2020/06/23
- Re: Read a fixed length of input each time, Peng Yu, 2020/06/23
- Re: Read a fixed length of input each time, Andrew J. Schorr, 2020/06/23
- Re: Read a fixed length of input each time, Peng Yu, 2020/06/23
- Re: Read a fixed length of input each time, Andrew J. Schorr, 2020/06/23
- Re: Read a fixed length of input each time, Peng Yu, 2020/06/23
- RE: Read a fixed length of input each time, Tom Gray, 2020/06/23
- Re: Read a fixed length of input each time, Neil R. Ormos, 2020/06/23
- Re: Read a fixed length of input each time, Andrew J. Schorr, 2020/06/23
- Re: Read a fixed length of input each time,
Neil R. Ormos <=