[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sort: memory exhausted with 50GB file
From: |
Leo Butler |
Subject: |
Re: sort: memory exhausted with 50GB file |
Date: |
Sat, 26 Jan 2008 15:05:30 +0000 (GMT) |
< Paul Eggert <address@hidden> wrote:
< ...
< > Hmm, it sounds like your input data has some very long lines, then.
< > That would explain at least part of your problem, then. 'sort' needs
< > to keep at least two lines in main memory to compare them: if single
< > input lines are many gigabytes long, then 'sort' must consume many
< > gigabytes of memory, regardless of what parameter you specify with '-S'.
<
< You can run this to find the maximum line length:
<
< wc --max-line-length your-data
Ok, first, let me thank Jim, Bob and Paul.
Here is the problem in a nutshell:
wc is counting with long ints, and the first line of this 50GB file is a string
of \0 whose length appears to be negative when counted with long ints. (Details
below).
I believe that this must be an error in the header file where 'uintmax_t' is
defined.
I do not know if one can consider this behaviour as a bug in sort, but
it seems to me that sort might issue a warning if it encounters 'n>0'
consecutive null characters in a file.
---
I have squeezed out the null characters with tr and am attempting
to sort the transformed file. This has shrunk the file from 50GB to 7GB, so I
anticipate no problems. I will report back.
---
Leo Butler.
Details:
-------
In my original post I mentioned I did count the max line length:
$ /usr/bin/wc -L /data/espace/k_400_a.out
107
Here is the censored output of a routine that counts the occurence of all ascii
characters:
$ ./census /data/espace/k_400_a.out
Ascii char Count
---------- -----
\0 Null character -1363090872
(snip)
The longest line was identified at about line 65x10^6 with 108 chars incl.
\n.
Ouch! Look at that count of \0. The routine was counting with long ints, so I
recompiled it with unsigned longs, and got
Ascii char Count
---------- -----
\0 Null character 2931876424
(snip)
Longest line 2931876444 chars at line 1
The counts of \0 are congruent mod LONG_MAX. Apparently, the first line
contained roughly 42GB worth of null characters. I have no bleeding idea how
this creeped in.
LB.
- sort: memory exhausted with 50GB file, Leo Butler, 2008/01/25
- Re: sort: memory exhausted with 50GB file, Bob Proulx, 2008/01/25
- Re: sort: memory exhausted with 50GB file, Leo Butler, 2008/01/25
- Re: sort: memory exhausted with 50GB file, Paul Eggert, 2008/01/25
- Re: sort: memory exhausted with 50GB file, Jim Meyering, 2008/01/26
- Re: sort: memory exhausted with 50GB file,
Leo Butler <=
- Re: sort: memory exhausted with 50GB file, Jim Meyering, 2008/01/26
- Re: sort: memory exhausted with 50GB file, Jim Meyering, 2008/01/26
- Re: sort: memory exhausted with 50GB file, Leo Butler, 2008/01/26
- Re: sort: memory exhausted with 50GB file, Paul Eggert, 2008/01/27
Re: sort: memory exhausted with 50GB file, Paul Eggert, 2008/01/25