Re: [lmi] Measuring MD5 overhead

lmi

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Measuring MD5 overhead

From:	Greg Chicares
Subject:	Re: [lmi] Measuring MD5 overhead
Date:	Tue, 7 Apr 2020 15:16:20 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0

On 2020-04-06 18:19, Vadim Zeitlin wrote:
> On Mon, 6 Apr 2020 15:34:56 +0000 Greg Chicares <address@hidden> wrote:

[Snipped discussion: the observed slowdown that results from revalidating
all data files before each PDF file is generated varies tremendously:
  20% (Greg)
  negligible (Kim)
  100% (Vadim)
(Your number is especially high because you're using a set of data files
that's several times as large.) It seems safe to conclude that some end
users will observe a noticeable slowdown. Thus, the solution to this issue:

  // TODO ?? Known security hole: data files can be modified after they
  // have been validated.

should not be to remove the caching in Authenticity::Assay() that causes
files to be validated only at startup rather than when each PDF report
is generated.

> GC> Here's another idea: reduce the amount of file I/O by redesigning the
> GC> revalidation code. Running an illustration requires accessing certain 
> files:
> GC> 
> GC>  - 'tables.{dat,ndx}', which are in a binary format that end users cannot
> GC>    readily modify--so it seems adequate to validate those only at startup;
> 
>  Do we need to do it in order to immediately detect any tampering or
> corruption? Or was this just the simplest way to do it and we wouldn't lose
> anything if we postpone validating them until their first use?

The original code was just simplest way to use an external program
for MD5 validation. I.e., we didn't want to parse out individual
lines in 'validated.md5' and then shell out to some MD5 '.exe' to
validate files piecemeal at various times.

As for the actuarial tables in 'tables.*': End users would find it
infeasible to modify those files, but they could copy over older
versions--so these files do need to be validated. And lmi can't
really do anything without accessing these files, so validating
them on startup is good enough--I see no good reason to delay that
until they're first used.

> GC>  - 'whatever_product.{database,funds,policy,rounding,strata}, which most
> GC>    end users can modify using the product editor;
> 
>  Sorry, but it's my day of stupid questions today: if they can be modified,
> how does it work with the existing validation schema? AFAICS, wrap_fardel
> target includes all these files in validated.md5, so changing them would
> result in a validation failure during the next run. What am I missing?

Today, end users can do this:
 - start lmi (thus validating all files)
 - use the product editor (or a text editor) to modify some product
 - produce PDF illustrations using that modified product
because files are validated only at startup. That's the reason for
this comment:

  // TODO ?? Known security hole: data files can be modified after they
  // have been validated.

A validation error would subsequently occur, true; but it would occur
only after exiting and restarting lmi. Thus, we address the validation
with a very large hammer, with which we don't strike at the ideal time.

> GC> For MD5 revalidation to provide security, we must perform it whenever data
> GC> flows through some chokepoint, which can be any of these:
> GC> 
> GC> (1) Each invocation of Authenticity::Assay() (if we inhibit its caching).
> GC> This brute-force approach (revalidate every file that could possibly be
> GC> used) was the only simple option when we were invoking an external md5sum
> GC> program, but it's needlessly slow.
> GC> 
> GC> (2)(a) Each XML file-read operation that goes through libxml2, if we use
> GC> some compression algorithm (which provides sufficient opacity to inhibit
> GC> casual users). But this works only for XML files, and we found practical
> GC> difficulties with both libz and liblzma when we tried using them.
> 
>  I don't even remember what these problems were, but I definitely agree
> that keeping the files in their original text form would be preferable.

IIRC, they were all build-time problems: libz was hard to build at all
because it's so poorly autotoolized; liblzma itself built readily, but
libxml2's makefiles didn't readily detect it.

Also IIRC, libxml2 is a chokepoint for product files only, but '.mst'
files are read by a different means.

>  I didn't even know about file_cache class existence until today, but it
> does look like a neat solution to the problem. I'll have to think a bit
> more about what would be the best way to integrate it with validation, but
> it certainly should be doable.

That code entered lmi here:

commit 3085b77c86dce36685a646d387d920f577c04abe
Author: Gregory W. Chicares <address@hidden>
Date:   2016-07-28T23:21:57+00:00

    Add a reusable caching class for data expensively read from disk (VS)

    See:
      http://lists.nongnu.org/archive/html/lmi/2016-07/msg00046.html

It's awesome, but I can't claim any credit for it.

[Prev in Thread]

Current Thread

[Next in Thread]

[lmi] Measuring MD5 overhead [Was: master 9c510ad 16/22: Measure elapsed time for MD5 data-file validation], Greg Chicares, 2020/04/06
- Re: [lmi] Measuring MD5 overhead, Vadim Zeitlin, 2020/04/06
  - Re: [lmi] Measuring MD5 overhead, Greg Chicares <=
    - Re: [lmi] Measuring MD5 overhead, Vadim Zeitlin, 2020/04/07
    - Re: [lmi] Measuring MD5 overhead, Greg Chicares, 2020/04/07

Prev by Date: Re: [lmi] Sourcing a shell script in a make file
Next by Date: Re: [lmi] Measuring MD5 overhead
Previous by thread: Re: [lmi] Measuring MD5 overhead
Next by thread: Re: [lmi] Measuring MD5 overhead
Index(es):
- Date
- Thread