[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Unknown fields in table input text files

From: Vadim Zeitlin
Subject: Re: [lmi] Unknown fields in table input text files
Date: Sun, 21 Feb 2016 16:42:14 +0100

On Sun, 21 Feb 2016 15:09:04 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2016-02-21 13:01, Vadim Zeitlin wrote:
GC> > 
GC> >  It could work like this and I do have such checks in the unit tests too,
GC> > but what really happens for them is
GC> > 
GC> >   binary --> platonic --> text --> platonic --> binary
GC> That's what we need...but...
GC> > The table_tool --verify option avoids the last step because it didn't seem
GC> > very useful and would require creating a temporary file (currently there 
GC> > no support for creating binary representation in memory and I don't see 
GC> > real need to add it), so it does just
GC> > 
GC> >   binary --> platonic --> text --> platonic
GC> > 
GC> > and compares the two C++ objects.
GC> Demonstrating commutativity requires validating all five steps:
GC>   binary --> platonic --> text --> platonic --> binary

 Yes, but the last step has unit tests and I'm relatively confident that
it's not going to fail, i.e. if we successfully read a table from binary
format representation, we will serialize it to exactly the same
representation. I don't think adding the extra tests for this last step are
worth the trouble (especially because, as mentioned before, we don't have a
way to serialize to an in-memory buffer right now, so we'd have to add it
just for this).

GC> The points I keep making that seem confusing relate to the
GC> conversions toward the end of the chain. I'm giving testcases
GC> that may not translate identically (and therefore losslessly)
GC> back to the same binary file we started with.

 What I want to make clear is that the step at which such losses are
most likely to happen is the "text --> platonic" one and the rest of them
(e.g. due to unknown record types) will happen in the "binary --> platonic"
step. It seems extremely unlikely that the conversion in other direction
doesn't give the expected result.

GC> > GC> If that's the condition tested by '--verify',
GC> > GC> then the content of a binary "Comment" (VT) record like
GC> > GC>   <VT> [bytecount] "\nContributor: foo\nNonexistent: bar"
GC> > GC> must not be parsed as markup for a "Contributor" (EOT) record and
GC> > GC> a "Nonexistent" record.
GC> > 
GC> >  Yes, of course, and it is not parsed like this or even "parsed" in any 
GC> > at all. As I said, there is no problem with reading the data in binary
GC> > format, the only questions are in "text --> platonic" step.
GC> Yes, or in the subsequent "[text -->] platonic --> binary" step.

 Sorry, either I misunderstand the above or I disagree. But, to be clear,
no, I don't have any _questions_ about the "platonic --> binary" step,
there is no possible ambiguity in it and while bugs are always possible, it
is covered by the full round trip tests on qx_cso and qx_idx in the unit
test, so I'm relatively confident in it.

GC> I agree. Introducing a typo in a header name when editing the text
GC> file means that the modified text file cannot be merged back into
GC> the binary file as intended. In this introduced-typo case, what
GC> was originally a "Contributor" record would become part of the
GC> content of a "Comments" record. The program you're writing in
GC> principle cannot diagnose this error. It must merge the text as a
GC> "Comments" record. It might issue a warning, which would catch this
GC> user error and would also flag the "Editor:" contents as a possible
GC> mistake.

 OK, so I'll use warning() for such warnings. But the question remains:
should I have a whitelist to avoid warnings for known non-typos? I.e. right
now you would be getting warnings for "Editor:" which is not as bad as
getting an error for it, but still annoying. Is it worth keeping the
(already implemented) whitelist to avoid these warnings?

GC> That's why we need to run the commutativity test against all the
GC> files we have, and inspect the failures. Only that way can we be
GC> sure we have the full whitelist.

 I still don't understand this :-( Whitelist applies to the conversions
from text only. There is no need for the full commutativity test, including
conversion to binary, to test for it. It doesn't come into play at all when
converting to binary format.

GC> Phase I: The program prints a warning, and the round-trip conversion
GC> may fail. Upon manual inspection of warnings and failures, we may
GC> discover undocumented fields. I want to emphasize that I have in fact
GC> discovered at least one undocumented record type.

 In the binary format? This is a very important new fact. But it still
doesn't help with my confusion above because such undocumented record type
must have resulted in an error when reading the database, i.e. at the very
first step. And to fix this we need to add support for this record type,
changing the whitelist is neither necessary nor sufficient for this.

GC> It is never required to inhibit reasonable warnings.

 This seems to answer my question above: if you don't consider warnings,
even about things such as "WARNING:" in the comments, a problem, then we
don't need a whitelist at all. I am not sure if it's a good idea because
the warnings risk being common and hence annoying, but so be it.

GC> It is never permissible to terminate with lawful input.

 The question here is, of course, about what exactly constitutes lawful
input. E.g. the checks on the input file format are rather strict right
now, for example the age must be specified as "  1" (always 3 positions)
and there must be exactly the given number of digits (as specified by the
"Number of decimal places") in all the values. We could easily be more
forgiving and only warn about using "1" for the age and "0.12" for the
value instead of "0.12000" when the number of decimal places is 5. I had
understood that we didn't want to do this, but now I suspect I could have
been wrong about this. Was I?

 If we really need to try to do our best with the input we're given,
instead of only accepting the input in the strictly defined format, then
quite a few changes to the code will be needed... 

GC> Where you write "Give an error" I assume, perhaps incorrectly, that
GC> you mean "...and terminate the program". Thus, I read (2) and (3)
GC> as leading to termination, which is not wanted. A "warning" is
GC> desirable, after which the program proceeds to do the best it can.

 An error means an exception and an exception can be caught and handled, of
course. In table_tool this is not currently done and any exception indeed
results in program termination.

GC> >  Also, I am still not speaking at all about binary files here. There may
GC> > be a problem with unknown record types in them too,
GC> That is a problem that I have actually encountered in the past.
GC> Unknown record types must cause a commutativity test to fail.

 They will, at the very first step. And here, again, "fail" means failing
to create the table entirely.

GC> Print an informational message and continue processing.

 I'll do this as soon as you can confirm that you really don't want to have
a whitelist to avoid warnings about known non-typos.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]