|
| From: | George Zarkadas |
| Subject: | Ver. 3.1.4 & 3.1.3 Windows ports: Chopped record count at large files |
| Date: | Sun, 27 Aug 2006 13:19:08 +0300 |
gawk reports a (very) smaller than actual record count when processing a
large (~ 275 MB) text file.
This behavior exists in:
3.1.4 version, xmlgawk windows port
(downloaded from
http://lml.ls.fi.upm.es/~mcollado/xmlgawk/xmlgawk-3.1.4_20040920_mingw.zip)
3.1.3 version, gnuwin32 windows port
(downloaded from http://sourceforge.net/projects/gnuwin32/ )
but not in the 3.0.4 version (mingw windows port) which gives the correct
results (as verified by independent checks).
As a consequence and in consistency with the above remark, gawk fails to
extract a subset of records from the file that are located near the end of
it.
Attached are included:
1. Results (as copied and pasted from the command line) from (a) running the
count scripts and (b) extracting the subset [files: count_results.txt and
subset_results.txt]
2. The awk scripts in question
Additional information
-- The file upon which the scripts operated contains bibliographic records
in bibtex format (converted from the xml file which is supplied by the DBLP
project as downloaded from www.vldb.org <http://www.vldb.org/> )
-- The scripts were run on two machines with identical results.
Configurations:
OS: Windows XP SP2 (EL) in both
CPU: Pentium M 1.7 GHz | Pentium 4 HT 3.0 GHz
RAM: 1 GB | 2 GB
HDD: 80 GB | 400 GB
-- A bug report has also been submitted to the gnuwin32 project (no related
contact-info was found for the 3.1.4 port). However I have the feeling that
this is not a windows-port specific behavior; hence this bug report.
Kind Regards
George Zarkadas
PS: The original file upon which the scripts acted is not included because
of its size (~55 MB zipped) but will be happily supplied if requested.
count_results.txt
Description: Text document
count_dblp_bib2.awk
Description: Binary data
count_dblp_bib.awk
Description: Binary data
subset_results.txt
Description: Text document
get_vldb_subset.awk
Description: Binary data
get_vldb_subset2.awk
Description: Binary data
| [Prev in Thread] | Current Thread | [Next in Thread] |