lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] numeric_io_traits problems, with a patch


From: Vadim Zeitlin
Subject: Re: [lmi] numeric_io_traits problems, with a patch
Date: Mon, 26 Feb 2018 21:40:30 +0100

On Mon, 26 Feb 2018 19:14:09 +0000 Greg Chicares <address@hidden> wrote:

GC> First of all, the '0' case is too complicated: its if() clause can
GC> never fail (if the precondition holds).

 Yes, I hadn't realized that the string had to contain the decimal point
here, sorry. Things can be simpler with this assumption, although it does
look rather dangerous to me to not check for it, especially as failure to
satisfy it can result in difficult to debug memory corruption.

GC>   loop:
GC>     switch(*ri)
GC>         {
GC>         case '0': ++ri; goto loop;
GC>         case '.': ++ri;
GC>         default : ;
GC>         }
GC> 
GC> So it's a simple state machine:

 Sorry, but I'm honestly amazed that you decide to view it like this. It is
correct, of course, but only in the sense that any loop is a state machine.
For me the above is just a loop written in assembly masquerading as C++.

GC>   enter in state A; leave only in state C; possible transitions:
GC>     A --> B --> C
GC>     A --> C

 So it's not even a DFA (A has 2 transitions), which makes the use of FSM
formalism here all the more surprising.

GC> With my remove-the-if change above, does it pass with all those
GC> diagnostic facilities enabled, even with the extra unit tests
GC> below added?
GC> 
GC> +  BOOST_TEST_EQUAL(      "0", simplify_floating_point(     "0.0"));
GC> +  BOOST_TEST_EQUAL(     "-0", simplify_floating_point(    "-0.0"));
GC> +  // THE FOLLOWING TESTS SHOULDN'T PASS; THEY JUST SHOULDN'T SEGFAULT
GC> +    // Tests such as these violate the unasserted precondition that
GC> +    // at least one digit precede the decimal point.
GC> +  BOOST_TEST_EQUAL(      "0", simplify_floating_point(      ".0"));
GC> +  BOOST_TEST_EQUAL(     "-0", simplify_floating_point(     "-.0"));
GC> +    // Try harder to provoke an error:
GC> +  BOOST_TEST_EQUAL(      "0", simplify_floating_point(       "0"));
GC> +  BOOST_TEST_EQUAL(      "0", simplify_floating_point(        ""));

 Yes, they do "pass", i.e. the new tests fail, but there are no errors
reported by address sanitizer nor exceptions.


GC> >  So the real fix would seem to be to add a "break" statement here, but I'm
GC> 
GC> Or remove the 'if' as above, and document how the FSM works.
GC> 
GC> > really tempted to rewrite this code to avoid the very confusing, IMO, use
GC> > of goto with switch, so I've created https://github.com/vadz/lmi/pull/70
GC> > instead. If you disagree with this PR, please let me know and I'll add a
GC> > "break" instead (or a fallthrough attribute if I'm wrong in my analysis
GC> > above).
GC> 
GC> I'm not enthusiastic about PR 70. It seems to use the same logic,
GC> and it just has different barriers to comprehension:
GC>  - original: the switch-and-goto construct
GC>  - PR 70: '++ri' in loop-control-line and also in if-statement,
GC>      and a goto in the guise of 'break'

 I know that saying that something is obvious doesn't make it true, but for
me the version of the code in PR 70 is obviously and immediately readable:
you can see that it's a loop (and you can immediately see that it stops
because the iterator is always incremented in the loop statement) and you
can see that it stops prematurely as soon as we find something different
from '0'. And by "can see" I mean that it's immediately apparent.

 I simply don't believe that anybody can honestly say the same about the
current code in the LHS of https://github.com/vadz/lmi/pull/70/files
Yes, it's not too complicated to understand what it does, but do you really
immediately see what's the loop exit condition is? Or, even, why does it
ever exit at all?

 We can discuss the complexity limit of what can and can't be called
"obvious", of course, but do we really need to discuss whether the RHS is
_more_ obvious than the LHS?


GC> I tried to find a way to write this with std::string functions,
GC> but it looked unnatural every way I could think of. (I wanted
GC> to call something like
GC>   find_last_of(find('.'), end(), ".0")
GC> but string::find() returns an offset, not an iterator; writing
GC> it with std::find() instead, and treating the string as a normal
GC> container, just seemed too weird.)

 With std::string methods I'd write it as

        auto const period = s.find('.');
        LMI_ASSERT( period != std::string::npos );
        if ( s.find_first_not_of("0", period + 1) == std::string::npos )
            {
            s.erase(period);
            }

I'm not sure I like it neither, but I don't think it can be made simpler
than that and it's still reasonably clear.

GC> Then this idea occurred to me:
GC> 
GC> #include "miscellany.hpp"               // rtrim()
GC> inline std::string simplify_floating_point(std::string const& s)
GC> {
GC>     std::string z(s);
GC>     rtrim(z, "0");
GC>     rtrim(z, ".");
GC>     return z;
GC> }

 This is the best version from readability point of view. It's immediately
obvious and pretty short. The drawbacks here are the performance (which
probably doesn't matter, except perhaps it does?) and lack of precondition
checking.

GC> If we want to rewrite it so that it's simple, transparent, and
GC> obviously correct, then I think that's the best way (noting that
GC> its correctness depends on rtrim() being correct). But, now
GC> that I understand the FSM above, the thought of calling rtrim()
GC> makes me feel unclean. We're just doing a
GC>   REPNZ SCASB
GC> with DF pointing backwards, and then backing up over the period,
GC> and we should be able to write that in C, within L1 cache,
GC> without a bunch of object copies and memory reallocations.

 Sure. Let's write it using SSE intrinsics and it will be blazingly fast
(especially if you simplify 8 or 16 floating point numbers in parallel). Do
we care?

 Also note that the backwards loop is never going to be as fast as the
direct one anyhow, so the version using std::string methods above risks
being faster than a hand-written assembly version (of course, it helps that
it doesn't perform any allocations) -- if you manage to measure the
difference between them for the strings of only up to ~20 characters long
at all.

GC> Is there a canonically beautiful way to write FSMs in C or C++,
GC> without unnatural-looking control constructs or gigantic
GC> third-party libraries?

 I don't know about canonical, but the best idea for writing FSMs I know of
is to use std::variant<> for representing the different states as this
makes them compile-time safe. It's hardly lightweight or appropriate for
something as simple as this however.

 Of course, I am still firmly convinced that FSMs are completely
inappropriate for this simple algorithm in any way, shape or form anyhow.
It is naturally formulated in terms of looping over the string and I see
absolutely no reason to implement it in any other way than by writing a
loop. The only question for me is whether it's worth looping backwards or
not: IME reverse iterators are very error-prone and so I'd avoid their use
by writing the loop in the normal direction, but I can see the argument in
favour of using them here too.

 So my criterion for choosing the implementation of this function would be
determined by this simple flowchart:

                          Do we care about
                            performance?
                        /                 \
                       /                   \
                      /                     \
                Yes  /                       \ No
                    /                         \
                   \/                         \/

         Are we afraid of               Use the version with
        reverse iterators?                    rtrim()
              /    \
             /      \
       Yes  /        \  No
           /          \
          /            \
         \/            \/

     Use std::string   Use PR 70
         methods


[...different problem...]
GC> Now it's failing again. I wouldn't want to
GC> 
GC>   if(some_compiler) goto turn_a_blind_eye_yet_again;
GC>     BOOST_TEST_EQUAL(15, floating_point_decimals(0.4503599627370497));
GC>   turn_a_blind_eye_yet_again:
GC> 
GC> without trying really hard to get to the root of the problem.
GC> Might valgrind have its own less-accurate log10l()?

 OK, I'll try really hard to find it then. Considering how long this email
already is, I'll discuss my results (or lack thereof) in a separate one
later -- sorry for bundling these 2 issues together, but I never thought
that PR 70 might be so controversial.

 Regards,
VZ


reply via email to

[Prev in Thread] Current Thread [Next in Thread]