lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] Measuring MD5 overhead [Was: master 9c510ad 16/22: Measure elapsed


From: Greg Chicares
Subject: [lmi] Measuring MD5 overhead [Was: master 9c510ad 16/22: Measure elapsed time for MD5 data-file validation]
Date: Mon, 6 Apr 2020 15:34:56 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0

On 2020-03-30 15:34, Vadim Zeitlin wrote:
> On Mon, 30 Mar 2020 13:26:42 +0000 Greg Chicares <address@hidden> wrote:
[...]
> GC> Summary:
> GC> 
> GC>   clock
> GC>    time  program used with "maximal" dataset
> GC>   ----- ------------------------------------
> GC>   0.194 wine lmi_md5sum.exe [i686]
> GC>   0.190 wine lmi_md5sum.exe [x86_64]
> GC>   0.200 wine md5sum.exe [i686 binary from the following URL]
> GC>     
> https://github.com/vadz/lmi/releases/download/new-cygwin-makefiles/md5sum.exe
> GC>   0.085 debian native md5sum
> 
>  FWIW, under native MSW 7 eraseme_lmi_md5sum.exe in the archive you
> provided is slightly but noticeably faster than Cygwin md5sum.exe: the zsh
> command line
> 
>       % time (repeat 10 $MD5SUM *.xyz | md5sum)
> 
> takes 1.2s with MD5SUM=./eraseme_lmi_md5sum.exe compared to 1.7s with
> MD5SUM=md5sum. Surprisingly, even though it's a faster machine with an SSD,
> it's still much slower than under slower Linux machine where it takes 0.61s
> with both standard md5sum and lmi_md5sum (built with -O2, it's much slower
> without optimizations, of course). This might say more about the
> comparative speed of Linux and MSW IO subsystems than anything else
> however.

Here's what I see on my reasonably fast debian machine with an SSD:

/opt/lmi/data[0]$MD5SUM=./eraseme_lmi_md5sum.exe               
/opt/lmi/data[0]$time (repeat 10 wine $MD5SUM *.xyz >/dev/null)
( repeat 10; do; wine $MD5SUM *.xyz > /dev/null; done; )  1.08s user 0.63s 
system 86% cpu 1.966 total

/opt/lmi/data[0]$MD5SUM=md5sum                                 
/opt/lmi/data[0]$time (repeat 10 $MD5SUM *.xyz >/dev/null) 
( repeat 10; do; $MD5SUM *.xyz > /dev/null; done; )  0.71s user 0.14s system 
100% cpu 0.849 total

The first measurement runs our own 32-bit msw binary under 'wine'.
The second runs debian's 64-bit native binary. The first thus takes
a little more than twice as long; to what should we attribute that?
This additional pair of measurements, for a tiny dataset...

/opt/lmi/data[0]$time (repeat 10 wine $MD5SUM expiry >/dev/null)
( repeat 10; do; wine $MD5SUM expiry > /dev/null; done; )  0.19s user 0.42s 
system 83% cpu 0.729 total
/opt/lmi/data[0]$MD5SUM=md5sum                                  
/opt/lmi/data[0]$time (repeat 10 $MD5SUM expiry >/dev/null) 
( repeat 10; do; $MD5SUM expiry > /dev/null; done; )  0.01s user 0.01s system 
101% cpu 0.020 total

...suggests that two-thirds of the difference is 'wine' startup time:

  1.966 - 0.849 = 1.117
  0.729 - 0.020 = 0.709

  0.709 / 1.117 = 63%

which I incur whenever I run 'wine /path/to/any_program.exe', but
end users never incurred. And of course the md5sum validation is
now internal, so we never need to shell out to an external program.

The immediate practical question is this: Now that md5sum validation
is internal and faster, is it fast enough that we should revalidate
all data files every time a report such as a PDF file is generated?
That would resolve this shortcoming, documented in 'authenticity.cpp':

  // TODO ?? Known security hole: data files can be modified after they
  // have been validated.

and I'd like to know how much relative performance would be impaired.
Thus, 'git show 77626be8dc06':

    use first one of these commands, then the other
      wine ./lmi_wx_shared --data_path=/opt/lmi/data
      wine ./lmi_wx_shared --data_path=/opt/lmi/data --pyx=measure_md5
    to run some scenario like
      File | New | Illustration
      OK
      File | Print to PDF
    and compare the elapsed time shown on the statusbar, to see the cost
    of reauthenticating before generating each PDF.
    
    On my machine, running under 'wine', I see:
      341 msec without '--pyx=measure_md5'
      405 msec  with   '--pyx=measure_md5'
    and (405-341)/341 is about a twenty-percent penalty.

On the other hand, Kim sees no noticeably penalty, running lmi
under native msw on a typical underpowered corporate laptop
(with a typical dataset, which might be one-third the size of
the '*.xyz' one used above):

  # first trial
Without '--pyx=measure_md5', Output: 451 milliseconds
With '--pyx=measure_md5', Output: 448 milliseconds

  # second trial
Without '--pyx=measure_md5', Output: 452 milliseconds
With '--pyx=measure_md5', Output: 438 milliseconds

Could I ask you to do the same (using native msw and the '*.xyz'
dataset) and report your results here? If your results roughly
agree with Kim's, then we should probably just resolve the
"TODO ??" issue above by inhibiting the md5sum validation cache
and revalidating before producing every report.

>  I thought we could decrease the time further by running several processes
> in parallel, but this doesn't help -- it looks like the overhead of
> launching a process is too high for such small tasks, even under Linux, and
> even if I use just 8 (== number of cores) processes in total. I'd like to
> explore using several threads for executing this in parallel inside a
> single process, normally this should result in a noticeable gain and this
> is supposed to be simple to do in modern C++, in theory (but things have an
> annoying tendency to work somewhat differently in practice, so we'll see).

It's worth looking for a way to make validation faster, because
that might
 - reduce the revalidation penalty, making the decision above easier; and
 - perhaps even reduce lmi's startup time (regardless of how the decision
   above is made, always validating all data files at least once, at
   startup, has some nonzero cost).

I would guess that threading won't help much, because the time it takes
to (re)validate a file is probably dominated by file I/O. But go ahead
and try that if you like, because my guess may be wrong.

Here's another idea: reduce the amount of file I/O by redesigning the
revalidation code. Running an illustration requires accessing certain files:

 - 'tables.{dat,ndx}', which are in a binary format that end users cannot
   readily modify--so it seems adequate to validate those only at startup;

 - 'whatever_product.{database,funds,policy,rounding,strata}, which most
   end users can modify using the product editor; and

 - '*.mst', which anyone can modify using a text editor...which is why we
   actually distribute "ROT256" versions named '*.xst' (it's inversion
   rather than rotation, but there's no standard short name for that,
   though we might say "bytewise 256s' complement'). That obfuscation is
   inconvenient for Kim and me; and the original '*.mst' contents don't
   contain any trade secrets, so we don't care if anyone can read them--we
   just don't want anyone to modify them.   

For MD5 revalidation to provide security, we must perform it whenever data
flows through some chokepoint, which can be any of these:

(1) Each invocation of Authenticity::Assay() (if we inhibit its caching).
This brute-force approach (revalidate every file that could possibly be
used) was the only simple option when we were invoking an external md5sum
program, but it's needlessly slow.

(2)(a) Each XML file-read operation that goes through libxml2, if we use
some compression algorithm (which provides sufficient opacity to inhibit
casual users). But this works only for XML files, and we found practical
difficulties with both libz and liblzma when we tried using them.

(2)(b) Each MST file-read operation that goes through "ROT256". Thus,
the union 2({a,b}) covers all the files we need to care about.

(3) Use 'cache_file_reads.hpp' for all the data files listed above.
We already use it for '*.database', for reasons of performance. Using
it for all data files would presumably make lmi faster (a pure win,
in and of itself), and it would also introduce a convenient chokepoint
that we could use for (re)validation. Its documentation says:
  /// For each filename, the cache stores one instance, which is
  /// replaced by reloading the file if its write time has changed.
It would be too harsh to prohibit all changes to all of these files
(then the product editor wouldn't be usable), but we could figure out
what to do in each case, e.g., prohibit using modified '*.mst' files
without '--ash_nazg', or '*.{database,policy,...}' without '--mellon'.

What do you think of (3)? I see numerous advantages. It could replace
{libz,liblzma} for '*.{database,policy,...}' with an alternative that
(unlike those compression libraries) could work well in practice.
Any file not yet in the cache could be MD5-validated only on first use
(and the 'validated.md5' file itself could be cached). Otherwise, no
MD5 recalculation would ever be needed (we could certainly presume
that a file's contents have changed iff its timestamp has). Startup
time for lmi could be reduced substantially, because files that are
never accessed would never be validated. This would take some extra
work, but I imagine it might interest you.

Of course, we could combine (3) with (2)({a,b}): instead of using
libxml2's attractive (but imperfectly integrated) decompression, we
could perform decompression ourselves at the (3) chokepoint.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]