[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Improvements in large single-line text files
From: |
Yoshio HASEGAWA |
Subject: |
Improvements in large single-line text files |
Date: |
Tue, 16 Nov 2021 01:06:55 +0900 (JST) |
* { font-size: 13px; font-family: 'MS Pゴシック', sans-serif;}p, ul, ol, blockquote
{ margin: 0;}a { color: #0064c8; text-decoration: none;}a:hover { color:
#0057af; text-decoration: underline;}a:active { color: #004c98;}Hello all,
Recently I've encountered an error message regarding to "buffer length".
With some research, I came to the conclusion that we can remove
this limitation regarding to large files(2G bytes over).
I was working on removing new-line characters in tsv files,
which are other than actual line separator of the file.
linuxlite@linuxlite:~/SomeTest$ sed -z 's/"\x0A"/\x01/g' test.txt
| sed -z 's/[\x0A\x0D]//g' | sed 's/\x01/\x0A/g' > test_out.txt
sed: regex input buffer length larger than INT_MAX
# Environment: sed version 4.7 (in x86_64 / Ubuntu 20.04.3 LTS)
I read about modifications in a similar case in the past.
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=30520
Then I had a few notices from perspectives of POSIX-conforming application,
so I made changes to the code as attached.
I have also tested string substitution against a file slightly larger than 4GB.
sh-4.4# truncate -s 2G input
sh-4.4# printf 'aaaaa' >> input
sh-4.4# truncate -s +2G input
sh-4.4# printf 'aaaaa\n' >> input
sh-4.4#
sh-4.4# sed/sed 's/a/b/g' input > output
sh-4.4# head -c `expr 2 \* 1024 \* 1024 \* 1024 + 5` output | tail -c 5 > rpl1
sh-4.4# tail -c 6 output | head -c 5 >> rpl1
sh-4.4# od -tx1c rpl1
0000000 62 62 62 62 62 62 62 62 62 62
b b b b b b b b b b
0000012
# Environment: sed version 4.8 (modified. x86_64 / docker container based on
centos:8)
Detailed backgrounds about the changes are below.
1. type of return value from re_search
When compiled with 64-bit systems with _REGEX_LARGE_OFFSETS(seemingly default),
the return value type becomes ssize_t, which is 8-byte long,
as defined in code the below.
lib/regex.h L553
extern regoff_t re_search (struct re_pattern_buffer *__buffer,
lib/regex.h L480
/* Type for byte offsets within the string. POSIX mandates this. */
#ifdef _REGEX_LARGE_OFFSETS
(omitted comments...)
typedef ssize_t regoff_t;
#else
(omitted comments...)
typedef int regoff_t;
#endif
So I think we should receive this value with a type wider than int,
otherwise we receive the value as negative number as reported in bug30520.
I changed the type of variable "ret" in match_regex(sed/regexp.c) to regoff_t,
and also changed related checks with a new constant "REG_MAX".
Similar implementation(using regoff_t) is found in grep's source codes.
It also shows us that searches with over INT_MAX length have been
already proven to some extant.
src/dfasearch.c L355
regoff_t start;
2. the behavior when re_search returns value(< -1)
I added return value check in match_regex, as grep does.
(I couldn't find how to produce the situation, though.)
src/dfasearch.c L492
if (start < -1)
xalloc_die ();
I hope these modifications would make sed a lot more robust,
when in situations I mentioned at the beginning.
What dou you think?
Best Regards,
Yoshio
sed-removing-int-max-limit.patch
Description: Text Data
- Improvements in large single-line text files,
Yoshio HASEGAWA <=