Improvements in large single-line text files

sed-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Improvements in large single-line text files

From:	Yoshio HASEGAWA
Subject:	Improvements in large single-line text files
Date:	Tue, 16 Nov 2021 01:06:55 +0900 (JST)

* { font-size: 13px; font-family: 'MS Pゴシック', sans-serif;}p, ul, ol, blockquote 
{ margin: 0;}a { color: #0064c8; text-decoration: none;}a:hover { color: 
#0057af; text-decoration: underline;}a:active { color: #004c98;}Hello all,

Recently I've encountered an error message regarding to "buffer length".
With some research, I came to the conclusion that we can remove
this limitation regarding to large files(2G bytes over).

I was working on removing new-line characters in tsv files,
which are other than actual line separator of the file.
  linuxlite@linuxlite:~/SomeTest$ sed -z 's/"\x0A"/\x01/g' test.txt
  | sed -z 's/[\x0A\x0D]//g' | sed 's/\x01/\x0A/g' > test_out.txt
  sed: regex input buffer length larger than INT_MAX
# Environment: sed version 4.7 (in x86_64 / Ubuntu 20.04.3 LTS)

I read about modifications in a similar case in the past.
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=30520
Then I had a few notices from perspectives of POSIX-conforming application,
so I made changes to the code as attached.
I have also tested string substitution against a file slightly larger than 4GB.

  sh-4.4# truncate -s 2G input
  sh-4.4# printf 'aaaaa' >> input
  sh-4.4# truncate -s +2G input
  sh-4.4# printf 'aaaaa\n' >> input
  sh-4.4#
  sh-4.4# sed/sed 's/a/b/g' input > output
  sh-4.4# head -c `expr 2 \* 1024 \* 1024 \* 1024 + 5` output | tail -c 5 > rpl1
  sh-4.4# tail -c 6 output | head -c 5 >> rpl1
  sh-4.4# od -tx1c rpl1
  0000000  62  62  62  62  62  62  62  62  62  62
            b   b   b   b   b   b   b   b   b   b
  0000012
# Environment: sed version 4.8 (modified. x86_64 / docker container based on 
centos:8)

Detailed backgrounds about the changes are below.

1. type of return value from re_search
When compiled with 64-bit systems with _REGEX_LARGE_OFFSETS(seemingly default),
the return value type becomes ssize_t, which is 8-byte long,
as defined in code the below.

lib/regex.h L553
  extern regoff_t re_search (struct re_pattern_buffer *__buffer,
lib/regex.h L480
  /* Type for byte offsets within the string.  POSIX mandates this.  */
  #ifdef _REGEX_LARGE_OFFSETS
  (omitted comments...)
  typedef ssize_t regoff_t;
  #else
  (omitted comments...)
  typedef int regoff_t;
  #endif

So I think we should receive this value with a type wider than int,
otherwise we receive the value as negative number as reported in bug30520.
I changed the type of variable "ret" in match_regex(sed/regexp.c) to regoff_t,
and also changed related checks with a new constant "REG_MAX".

Similar implementation(using regoff_t) is found in grep's source codes.
It also shows us that searches with over INT_MAX length have been
already proven to some extant.

src/dfasearch.c L355
  regoff_t start;

2. the behavior when re_search returns value(< -1)
I added return value check in match_regex, as grep does.
(I couldn't find how to produce the situation, though.)

src/dfasearch.c L492
  if (start < -1)
    xalloc_die ();

I hope these modifications would make sed a lot more robust,
when in situations I mentioned at the beginning.
What dou you think?






Best Regards,

Yoshio

sed-removing-int-max-limit.patch
Description: Text Data

[Prev in Thread]

Current Thread

[Next in Thread]

Improvements in large single-line text files, Yoshio HASEGAWA <=
- Re: Improvements in large single-line text files, Assaf Gordon, 2021/11/15

Next by Date: Re: Improvements in large single-line text files
Next by thread: Re: Improvements in large single-line text files
Index(es):
- Date
- Thread