lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] [PATCH] CensusView::Update() optimizations


From: Vadim Zeitlin
Subject: [lmi] [PATCH] CensusView::Update() optimizations
Date: Sat, 28 Feb 2015 02:04:05 +0100
Date: Fri, 27 Feb 2015 15:05:34 +0100

 Hello,

 This is another long message because I've been writing it on and off for a
couple of months, but it can be summarized like this:

1. Please apply the patch 1 to speed up column updating by a factor of ~4
   in the variable width mode.

2. Please apply the patches 2-5 to speed this up by a factor of ~2 in the
   fixed width mode in addition (i.e. variable width is ~8 times faster).
   These patches should be completely safe to apply.

3. Please apply the patch 6 if we can assume that all Input cells are
   homogeneous, i.e. have the same members, as I believe they do. This
   brings the gains to ~10 times for the fixed width and ~25 times for
   the variable width case.

4. It is still not clear if Update() is fast enough even after all these
   changes to be called after every cell change or not. This depends on the
   typical census size and speed of the machines on which lmi is used.

 Full details and explanations follow below.


On Sun, 02 Nov 2014 01:40:21 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2014-11-01 23:57Z, Vadim Zeitlin wrote:
GC> > On Sat, 01 Nov 2014 22:05:38 +0000 Greg Chicares <address@hidden> wrote:
GC> > 
GC> > GC> For a given column in the census manager, suppose that all cells
GC> > GC> have the same value, except for one differing cell. If we edit
GC> > GC> that differing cell and remove that difference, then all cells
GC> > GC> are now the same, and the column is no longer to be shown. When
GC> > GC> is the display to be thus updated? IOW, when should the function
GC> > GC> that updates the display, CensusView::Update(), be called?
GC> > 
GC> >  I think the best solution for this would involve doing the work done by
GC> > update_class_names() on a separate thread and then requesting that the 
main
GC> > thread updates the columns only if really necessary -- which should be a
GC> > relatively rare occurrence.
GC> 
GC> That idea has two parts:
GC> 
GC> (1) Threads: I'm reluctant to go there at this time, fearing a host
GC> of problems.
GC> 
GC> (2) Update columns only if necessary: yes, I'd like to do that.

 The first attached patch does exactly this and, as expected, it does have
a significant effect when switching to the "Varying column widths" mode.
With MSVC 12 release build I observed a 45% gain for the first switch and
90% gain for the subsequent ones (again, this is not surprising: before,
the column widths cache was cleared every time as the columns were
recreated, now that it is not cleared any more the subsequent column widths
recalculations reuse it and so are significantly faster).

 Unfortunately the gains are less impressive with gcc 3.4.5: there we gain
only ~40% for the first switch and ~75% for the subsequent ones. I suspect
this is due to poorer quality of optimization in this ancient version of
gcc, as the code executed inside lmi itself seems to take a greater
proportion of time compared to the time spent in Windows functions
themselves (which is the same for MSVC and gcc, of course), so one can hope
that upgrading to a newer gcc version might improve things for it.

 Nevertheless, this patch doesn't have any drawbacks and brings the time to
switch to "Varying widths" mode down from ~5.5s to ~3.4s for the first time
and ~1.4s for the subsequent ones, i.e. is pretty noticeable for the user
and so should be applied.

[As an aside, all the benchmarking were done by taking the second smallest
 measurement out of 20 runs. The rationale for taking this number instead
 of the average one is that on a not quite idle machine the time can be
 only increased by the other processes executing on it and so the smallest
 time is more representative of the intrinsic speed of the code. And the
 reason for not taking the smallest one just to exclude any suspicious
 outliers (but in practice the smallest and second smallest one were always
 very close or even identical anyhow).]


GC> > GC> But there has to be a way to trigger Update() aside from the
GC> > GC> triggers documented above, so I asked myself what a reasonably
GC> > GC> ingenious user would try. The answer seemed obvious:
GC> > GC>   
http://svn.savannah.nongnu.org/viewvc?view=rev&root=lmi&revision=6016
GC> > 
GC> >  This is indeed ingenious, but I am pretty sure fully automatic solution 
is
GC> > achievable as well. Definitely so if you're prepared to use a background
GC> > thread and perhaps even without it.
GC> > 
GC> >  Please let me know if it's worth investigating this further,
GC> 
GC> Yes, please.

 I spent quite some on profiling this with MSVC-specific tools, as they
are rather more convenient than gcov and definitely much faster to use, and
then optimizing it and checking the results under both MSVC and gcc and
I give the brief summary of this long process below. If you're interested
in all the exact numbers, I made them available on

https://docs.google.com/spreadsheets/d/12bXWDT3SO9hMOLkFjeRRhqLi8gsGXWBUlzR0-ZpmnCM/

The three sets of benchmarks correspond to the MSVC 12 release build of lmi
running natively, the same binary running inside a VM and the binary built
with gcc 3.4.5 also running inside the VM. The "delta" columns correspond
to the gain compared to the previous row, the baseline Update() time in the
fixed widths case and the baseline Update() time in the variable widths
case respectively. The MSVC VM numbers also have an extra column measuring
the slowdown compared to the native version, i.e. the VM overhead. As you
can see, it varies a lot: from just 102% (i.e. almost no slowdown at all,
maybe because VMware paravirtualizes the Windows text measuring APIs where
most of the time is spent in this line?) for the variable width case
baseline to ~170% in the other cases (I have no real explanation for this).
Finally, the gcc numbers have an extra column measuring the slowdown
compared to the MSVC build running inside the same VM. Assuming that more
recent g++ versions optimize code comparably to MSVC (and I believe this is
roughly speaking the case), this could be seen as an estimation of the
potential performance gains from upgrading to gcc4/gcc5.


 Anyhow, here is the promised "brief" summary: to begin with, I measured
the time of Update() execution using the same methodology as above (i.e.
second smallest of 20 runs), which went into the two first rows of the
spreadsheet above. Then I profiled it and saw that its execution time was
completely dominated by update_visible_columns(), which took 98% of its
running time, and 99.95% (this is a measurement, not an exaggeration) of
its own running time was spent in column_value_varies_across_cells().

 The patch from above helps a little with Update() speed even in the fixed
widths case, but the gain is small, of order of 6-7%. Looking at the
function, there was one obvious micro-optimization to do in it, which was
to move an invariant value out of the loop, as this is clearly not done by
neither MSVC (which gains 23% outside of VM and 16% inside it after
applying this patch) nor gcc (which gains 14%), so the second patch is
definitely worth applying too.

 Unfortunately even with both of these patches applied, Update() still took
355ms in the native MSVC build on my relatively fast machine and ~1100ms
for the gcc build inside the VM which is probably much closer to the
average client machine performance than my workstation. So even though we
gained between 20 and 30%, the function is still much too slow to be called
implicitly in response to the user action, e.g. after deleting a cell, as
such operations are supposed to not take more than 100ms, per the usual
rule of thumb.

 So I had no choice but to start looking outside of the CensusView code
and, notably, at MemberSymbolTable and any_member<> itself. My first change
here is the third patch and is not very interesting from performance
perspective, but in addition to speeding up the code a little, it also
makes it simpler, so I think it should still be applied. The patch just
removes a redundant field, which was stored in the any_member<> itself as
well as in its content_ field.

 The next optimization concerns MemberSymbolTable: it uses std::map<> which
is the best general purpose container for the mix of insert and lookup
operations but not at all the best choice for the use pattern of this class
in which the items are inserted in it only once and then looked up many
times. The theoretically best container for such use is a sorted vector and
this is what the patch #5 does (it relies on the previously posted patch
defining a working swap() for any_member, which I include as the patch #4
in this series for completeness). I was pretty upset with the effect of
this patch with MSVC where the gain is just 2%, but gcc gains almost 30%
with it. This is probably more due to the poor efficiency of its std::map
implementation than to anything else but it's still good news and brings
the cumulative gain to almost 50% compared to the baseline. But Update() is
still too slow.

 As an aside, I also experimented with C++11 std::unordered_map<> (which I
could only test in the MSVC build, of course, as gcc 3.4 doesn't have it).
Unlike the patch #5, this was a trivial change, I basically just replaced
std::map with std::unordered_map and added an #include line without
changing anything else -- but this was enough to gain more than 20%. It
turns out that the sorted vector is still better, see below, but this does
show that unordered_map<> is a much better choice than map<> for such use
pattern.

 I also have to mention that before writing my own sorted vector, I tried
using boost::container::flat_map<> which also resulted in ~20% gain
compared with std::map. Unfortunately, flat_map is not in Boost 1.47 used
by lmi and even though I could reuse its (header-only) implementation, it
relies on the map elements being either copy-assignable or move-assignable
and any_member<> is not copy-assignable as discussed at length in the other
thread, so I had to use C++11 move assignment operator to make lmi work
with it. And because of this reliance on C++11 support I couldn't use this
class, finally. But the annoying thing is that it is so much faster than my
own sorted vector and I have no idea why. I thought that this could be due
to a better memory access locality as flat_map stores the keys and the
values together while I store the keys in the (already existing) separate
vector, so I tried doing the same thing and also storing the pairs of
(name, any_member) in the vector instead of just the latter, but this only
made my version a very tiny bit slower. I also saw that flat_map used its
own binary search implementation instead of std::lower_bound() so I checked
if I could be paying a penalty for using the standard function, but
replacing it with the hand written code gained me absolutely nothing. So
this remains a mystery and I'll try to return to this later because this
20% discrepancy between the very similar implementations definitely must
have some explanation. But for now I'm out of ideas...

 And we're getting to the end of the story because with the vector-based
implementation of MemberSymbolTable I could finally apply the optimization
which I couldn't use before but which is clearly very helpful: the loop in
column_value_varies_across_cells() currently searches for the member with
the given name on each loop iteration, but all cells in Input have the same
structure (please correct me if I'm wrong, this assumption is absolutely
vital) and so we can compute the member offset only once, outside the loop,
now and just reuse it later. This speeds things up by 70-80%, i.e. we get
up to 5-fold improvement. And the best thing is that this time the
improvement is greatest for the gcc build and brings it almost to the same
speed as MSVC. I have no idea why does gcc profit from this so much more
than MSVC but then I'm quite convinced that if you don't obtain some
completely unexpected results when doing benchmarks, you must be doing
something wrong -- so at least it's reassuring from this point of view.


 The conclusion is that the final version is more than 10 times faster than
the original one in the fixed width case and more than 25 times faster in
the variable width one, when using gcc. Unfortunately this may still not be
quite enough to start calling Update() in response to the user actions as
it is still not really under the 100ms barrier and in interactive testing
you easily feel the sluggishness with the big census.

 I don't know if the ~4000 rows in this census are the (realistic) upper
limit or more of an average. In the latter case we definitely still can't
afford making unconditional Update() calls but in the former it might just
pass. Optimizing Update() further is going to be difficult, the execution
time is now spread pretty evenly among various simple functions, notably
all sorts of comparison operators which risk to be difficult to optimize,
e.g. tn_range<> one takes almost 10% of Update() time but it's just a
simple comparison. My best hope probably remains to find out what does
flat_map do differently to achieve so much better performance.

 Please let me know if you have any questions about this work or the
patches and I hope they can be applied because I think lmi users should
appreciate them.

 Thanks in advance,
VZ

Attachment: 0001-Only-recreate-census-view-columns-if-really-necessar.patch
Description: Text document

Attachment: 0002-Micro-optimization-in-CensusView-column_value_varies.patch
Description: Text document

Attachment: 0003-Remove-unused-any_member-object_-field.patch
Description: Text document

Attachment: 0004-Define-global-swap-for-any_member.patch
Description: Text document

Attachment: 0005-Use-a-sorted-vector-instead-of-std-map-in-MemberSymb.patch
Description: Text document

Attachment: 0006-Allow-computing-MemberSymbolTable-member-access-inde.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]