[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files

From: Greg Chicares
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sun, 21 Feb 2016 12:38:51 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.5.0

On 2016-02-20 15:12, Vadim Zeitlin wrote:
> On Sat, 20 Feb 2016 14:33:01 +0000 Greg Chicares <address@hidden> wrote:
> GC> Yes, to the naive reader, it would appear that this file has a novel
> GC> "Editor" record; but in fact it does not. Therefore...
>  Sorry, I'm afraid I didn't explain the problem well at all and wasted your
> time with all these investigations. Let me try to explain it once again:
>  There are 2 formats for the tables. One of them is the binary format used
> in the .dat files. The other one is the text format used for the output of
> --extract table_tool option and the input of its --merge option. There is
> no problem whatsoever with the binary format, the question of this thread
> is about how to handle unknown words followed by colons in the text files
> *only*.
>  E.g. the round-trip test done by table_tool --verify option reads all
> tables from the binary database, converts each of them to text and then
> parses this text as a table and compares that it obtains exactly the same
> table as the original one.

Where you say it "parses this text as a table", do you mean exactly the
same thing as converting the text back into binary format...so that we
have only two concepts, text and binary?

Or are you thinking of the "table" that you compare as the platonic
ideal, which has two incidental projections, one onto a text format, and
the other onto a binary format...neither of which is the real "table"?

I was thinking of it the first way, so that '--verify' converts
  binary format --> text format --> binary format
and the round-trip test is satisfied if the binary input and output
are bit-for-bit identical. If that's the condition tested by '--verify',
then the content of a binary "Comment" (VT) record like
  <VT> [bytecount] "\nContributor: foo\nNonexistent: bar"
must not be parsed as markup for a "Contributor" (EOT) record and
a "Nonexistent" record. If they were, they couldn't complete the round
trip back to binary format.

> The code doing the parsing tries to be rather
> strict, as previously discussed, so it complains if it sees something that
> looks like a field at the beginning of a line but isn't actually a known
> field. We would, presumably, like to prevent this from happening.

Yes, and I think the only way to prevent unrecognized record types is
to accept only recognized record types. In the motivating case:

Comments: These are supposed to represent the expected mortality of pensioners 
the generation born in 1950, updated through 1990-92 census results.
This is from the diskette available with
"The Second Actuarial Study of Mortality in Europe"
Editor: A.S.MacDonald

"Comments:" tags a record; "Editor:" does not, so it must be mere
content in a "Comments" record.

>  And the question is whether we should:
> 1. Just silently ignore all unknown fields ("ignore" means considering them
>    to be part of the value of the previous line).
> 2. Give an error about them (as the code used to behave).
> 3. Give an error about them except if they're in a (hopefully short) list
>    of known non-fields (as the latest version of the code does).

You're asking what to do about "unknown fields" like "Editor:" above.
I'm saying they aren't fields, which implies (1).

However, there's a problem. We don't necessarily know the full set of
record-types, i.e., the true "fields", because the SOA may have expanded
the set after publishing that code. The full set can be found only by
examining each binary-format record in every file. If there is a record
type that isn't yet in our list, we can't ascertain that by parsing the
text format, where such a yet-unknown true field is indistinguishable
from "Editor:" above.

> GC> > GC> (2) Use a regex like /[A-Za-z0-9]* *[A-Za-z0-9]*:/ on the 
> assumption that
> GC> > GC> header names consist of one or two words followed by a colon. Deem 
> any
> GC> > GC> colon that occurs later in the line to be content rather than 
> markup.
> GC> 
> GC> This cannot work. A "Contributor" specified as
> GC>   "\nSource of data:\Table number:\nContributor:"
> GC> cannot be parsed this way.
>  Sorry, I don't understand this at all. If we have a line starting with
> "Contributor:" in the text input, it will be parsed as contributor. Notice
> that if it is followed immediately by "\n", an error will be given about
> missing contributor value.

The problematic example above can only be an error, which necessarily
prevents the commutativity we hope to find. It is reasonable to hope
that no pathological example like this will occur, but we can't be
sure of that until we test every file.

> GC> >  Yes, I definitely need to do this to avoid at least the obvious false
> GC> > positives. The trouble with "Editor:" and "WARNING:" is that they're not
> GC> > really obvious, are they.
> GC> 
> GC> Actually, we must not do this. And "Editor:" and "WARNING:" are not
> GC> record titles and do not begin new records. Records are indicated
> GC> by prefixed bytes like EOT and VT.
>  Yes, in binary format. I'm only speaking about _reading_ (not writing)
> text files.

If commutativity holds, then reading a text file must produce the
same effect (the same platonic ideal--the same data in the same data
structure) as reading the corresponding binary file.

> GC> (Therefore, record content must not include those bytes.)
>  Sorry for being pedantic, but this is not really correct, the fields in
> these files are prefixed by their type and length, i.e. the strings can
> contain any bytes, including NUL.

Okay. And, as the example with German text indicates, the contents
are not necessarily ASCII. But if the contents of a field would be
parsed as type and length bytes, then we may have a pathological

> GC> >  Would we include "WARNING" in this whitelist?
> GC> 
> GC> No. It's not a record type.
>  Again, I think this answers some other question from the one I had asked
> because there are no records in the text format.

Then "\nWARNING: ..." cannot begin the text format of a real record.

>  FWIW I did include "WARNING" in the list of known not-fields together with
> "Editor" for now just to let qx_ann validate successfully. I can remove it
> from there, of course, or even drop the idea of such whitelist entirely.
> But we'd need some other solution then and this one seems the best to me so
> far.

The whitelist is indispensable. Without a full whitelist, when we
encounter "\nfoo: " while parsing the text format, we cannot say
whether it's the text image of a binary record, or merely part of
the content of some record.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]