[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improvements in large single-line text files

From: Assaf Gordon
Subject: Re: Improvements in large single-line text files
Date: Mon, 15 Nov 2021 12:31:46 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0


On 2021-11-15 9:06 a.m., Yoshio HASEGAWA wrote:
Recently I've encountered an error message regarding to "buffer length".
With some research, I came to the conclusion that we can remove
this limitation regarding to large files(2G bytes over).

Thanks for the detailed report and patch.

To be more specific:
The problem is not in "general" editing of large files (>2GB),
but in loading such files as a single line, and then running a regular expression on that buffer.

I was working on removing new-line characters in tsv files,
which are other than actual line separator of the file.
   linuxlite@linuxlite:~/SomeTest$ sed -z 's/"\x0A"/\x01/g' test.txt
   | sed -z 's/[\x0A\x0D]//g' | sed 's/\x01/\x0A/g' > test_out.txt
   sed: regex input buffer length larger than INT_MAX
# Environment: sed version 4.7 (in x86_64 / Ubuntu 20.04.3 LTS)

As a side note,
You might be able to get better performance by using sed with tr(1),
have sed append '\x01' marker after "real" line breaks in the TSV:

  cat test.txt \
     | sed 's/"$/"\x01/' | tr -d '\r\n' | tr '\001' '\n' > test_out.txt

Though in general a multi-line TSV/CSV files are so finicky it's likely
better to use a more sophisticaled loader than just text-processing with sed, e.g.:

I read about modifications in a similar case in the past.
Then I had a few notices from perspectives of POSIX-conforming application,
so I made changes to the code as attached.
I have also tested string substitution against a file slightly larger than 4GB.

   sh-4.4# truncate -s 2G input
   sh-4.4# printf 'aaaaa' >> input
   sh-4.4# truncate -s +2G input
   sh-4.4# printf 'aaaaa\n' >> input
   sh-4.4# sed/sed 's/a/b/g' input > output
   sh-4.4# head -c `expr 2 \* 1024 \* 1024 \* 1024 + 5` output | tail -c 5 > 
   sh-4.4# tail -c 6 output | head -c 5 >> rpl1
   sh-4.4# od -tx1c rpl1
   0000000  62  62  62  62  62  62  62  62  62  62
             b   b   b   b   b   b   b   b   b   b
# Environment: sed version 4.8 (modified. x86_64 / docker container based on 

Detailed backgrounds about the changes are below.

1. type of return value from re_search
When compiled with 64-bit systems with _REGEX_LARGE_OFFSETS(seemingly default),
the return value type becomes ssize_t, which is 8-byte long,
as defined in code the below.

lib/regex.h L553
   extern regoff_t re_search (struct re_pattern_buffer *__buffer,
lib/regex.h L480
   /* Type for byte offsets within the string.  POSIX mandates this.  */
   (omitted comments...)
   typedef ssize_t regoff_t;
   (omitted comments...)
   typedef int regoff_t;

So I think we should receive this value with a type wider than int,
otherwise we receive the value as negative number as reported in bug30520.
I changed the type of variable "ret" in match_regex(sed/regexp.c) to regoff_t,
and also changed related checks with a new constant "REG_MAX".

Similar implementation(using regoff_t) is found in grep's source codes.
It also shows us that searches with over INT_MAX length have been
already proven to some extant.

src/dfasearch.c L355
   regoff_t start;

2. the behavior when re_search returns value(< -1)
I added return value check in match_regex, as grep does.
(I couldn't find how to produce the situation, though.)

src/dfasearch.c L492
   if (start < -1)
     xalloc_die ();

I hope these modifications would make sed a lot more robust,
when in situations I mentioned at the beginning.
What dou you think?

This will require some delicate testing, as we don't want to introduce
any regressions (e.g. for 32bit systems), so it'll take more time to evaluate.

 - assaf

reply via email to

[Prev in Thread] Current Thread [Next in Thread]