poke-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSSION] FILE* slowness


From: apache2
Subject: Re: [DISCUSSION] FILE* slowness
Date: Sun, 17 Jul 2022 11:06:15 +0200
User-agent: Mutt/1.9.3 (2018-01-21)

OK, so here is some more analysis.

I started out with the "stupid" implementation that mapped
(ios_dev_file_pread) and (ios_dev_file_pwrite) directly to
(pread) and(pwrite).

This was, not totally unexpectedly, slower than our current
FILE*-based code with the ftello patch applied.
Presumably because `fcntl(SEEK_CUR)` is faster than `pread64`.

Then I added a buffer. That made things look a little bit better.
Here are the results of `time` on my benchmark, as executed on my
shitty old laptop with a bunch of other stuff running in the background. Ie 
YMMV.
I use a work-in-progress `zfs.pk` (available in the Pokology repo) that reads a
lot of uint32s to compute the Fletcher4 checksum used in the `ZFS send` file 
format.
I thought this checksum pickle would be a good benchmark, both because most
of the heavy lifting consists of contiguous reads, but also because it would
be more likely to catch bugs in the implementation
(by way of complaining about incorrect checksums).

Here bash's `time` builtin, three successive runs:

FILE* (ftello patch):
      real      1m11.913s
      user      1m1.259s
      sys       0m0.525s
      --
      real      1m13.129s 
      user      1m1.132s
      sys       0m0.458s
      --
      real      1m9.563s
      user      1m0.466s
      sys       0m0.514s
pread (4096 byte buffer):
      real      1m5.867s
      user      0m56.473s
      sys       0m0.477s
      --
      real      0m57.814s
      user      0m52.453s
      sys       0m0.472s
      --
      real      0m59.327s
      user      0m53.165s
      sys       0m0.468s
pread (8192 byte buffer):
      real      1m6.722s
      user      0m56.645s
      sys       0m0.345s
      --
      real      1m3.726s
      user      0m55.380s
      sys       0m0.391s
      --
      real      1m0.855s
      user      0m54.288s
      sys       0m0.435s


Attached you'll find three perf reports (50 first lines), these are based on 
current master:

  commit 273c468495f71b1f08e7a7169928ef40e8a3c184 (origin/master, origin/HEAD, 
gnu/master, master)
  Author: Jose E. Marchesi <jose.marchesi@oracle.com>
  Date:   Thu Jul 14 19:00:39 2022 +0200
    remove etc/pk-mi-json-schema.json

- perf.FILE.txt is the current master version
- perf.pread_4k.txt is for the attached pread patch with 4k buffers
- perf.pread_8k.txt is for the attached pread patch with 8k buffers

In perf.FILE.txt we see the following relevant entries:
     3.74%         10851  poke     libc-2.33.so        [.] __ftello64
     3.55%         10301  poke     libc-2.33.so        [.] _IO_fread
     3.43%          9941  poke     libc-2.33.so        [.] _IO_ferror
     0.61%          1777  poke     libc-2.33.so        [.] 
_IO_file_seekoff@@GLIBC_2.2.5
     0.69%          2006  poke     libpoke.so.0.0.0    [.] ios_dev_file_pread
     0.60%          1750  poke     libc-2.33.so        [.] _IO_file_xsgetn
     0.49%          1410  poke     libpoke.so.0.0.0    [.] 
ios_dev_file_get_flags
     0.40%          1160  poke     libc-2.33.so        [.] _IO_seekoff_unlocked

This sums up to approximately 13.51% of the runtime. I did not filter the 
callgraph for calls from ios_dev_file_pread, but I believe it's representative 
enough.

In perf.pread_4k.txt we instead see:
     1.84%          4591  poke     libpoke.so.0.0.0    [.] ios_dev_file_pread
     0.48%          1199  poke     libpoke.so.0.0.0    [.] 
ios_dev_file_get_flags
And in perf.pread_8k.txt:
     1.85%          4738  poke     libpoke.so.0.0.0    [.] ios_dev_file_pread
     0.53%          1351  poke     libpoke.so.0.0.0    [.] 
ios_dev_file_get_flags

So we are now spending ~approximately 2% of the time dealing with file reading 
instead of 13.5%.

An interesting sidenote is that we have a lot of calls to 
ios_dev_file_get_flags; if we do not expect these to change (I don't think the 
API allows this?), we could catch that so libpoke doesn't have to call that 
function for each and every read. Something to look into in the future.

We also see that the relative cost of (the time spent in) most of the other 
expensive functions have gone up comparing perf.FILE.txt to perf.pread_4k.txt:
-    13.14%         38114  poke     libc-2.33.so        [.] 
__memmove_avx_unaligned_erms_rtm
+    15.30%         38181  poke     libc-2.33.so        [.] 
__memmove_avx_unaligned_erms_rtm
-    10.86%         31485  poke     libpoke.so.0.0.0    [.] 
pvm_execute_or_initialize
+    12.37%         30873  poke     libpoke.so.0.0.0    [.] 
pvm_execute_or_initialize
-     7.94%         23035  poke     libc-2.33.so        [.] 
__memset_avx2_erms_rtm
+     9.29%         23187  poke     libc-2.33.so        [.] 
__memset_avx2_erms_rtm
-     6.44%         18683  poke     libgc.so.1.4.3      [.] GC_malloc_kind
+     7.83%         19541  poke     libgc.so.1.4.3      [.] GC_malloc_kind
-     7.01%         20329  poke     libgc.so.1.4.3      [.] GC_mark_from
+     6.07%         15151  poke     libgc.so.1.4.3      [.] GC_mark_from

This, seen in the context of the fact that execution took about 20% less time 
wall clock time,
is indicative that we are onto a beneficial optimization.

Comparing the two buffer sizes we learn that my benchmark was probably not good 
enough;
the statistics are inconclusive and the differences do not look statistically 
significant to me.

It does look like 8k might be slightly faster, but in my example pread_4k won,
and I wouldn't be willing to bet much on either being superior without better 
data, including more varied benchmarks. I wonder if we have something like that 
to catch performance regressions?
One way to obtain better data would be to increase the sampling rate of `perf`, 
or to switch to tracing instead.

I looked at the access patterns by tracing the pread patch with strace and the 
following awk script:
  strace --string-limit=0 -f -e pread64 ./run poke \
  -c 'load "../pokology/pickles/zfs.pk";' -c 'open("my.zfs-send")' \
  -c "var x=(ZFS_send @ 0 : 0#b);" -c "x.checksum" \
  |& tr -d ',)' \
  |  awk 'BEGIN{last=0; successive=0;}; /pread64/ && /pid/ {if ($6 == 
last){succ++;}else{succ=0;} last=$6+$8; print 
substr($3,9)"\t"$5"\t"$6"\t"$8"\t"succ;} END{print last}'
This counts the number of successive reads and can be used to get an idea of 
how we're doing.
If our general access pattern was scattered and all over the map we'd see "0" 
for most of the columns; but 

Moving forward, I propose the attached patch to replace the
default FILE*-based iod subsystem. The way Poke currently
utilizes this subsystem is in many (the majority of) cases to
read 1 byte from a given offset, at least in my benchmark.

Looking at zfs.pk it looks like it *should* read in increments of 4 bytes,
but I don't know if endianness (in zfs.pk set to big-endian;
my processor is little-endian) somehow messes that up.

I tried to figure out if I could make it read things in
larger batches, but as far as I can tell the reads come from
inside PVM, and I don't know enough about that part of Poke to
make intelligent suggestions other than "maybe we could see if we
could read more data at a time".

Lastly, the code attached to this PR does not do write caching, and
instead of updating the buffer when writing it invalidates the read
cache. I have not benchmarked writing, but we should come up with
a benchmark for that so we can work on the pwrite side of things.
There are also a number of outstanding problems / discussions points
marked with TODO in the attached patch, it would be nice if we could
discuss them too before merging. :-)

~ap

Attachment: perf.FILE.txt
Description: Text document

Attachment: perf.pread_4k.txt
Description: Text document

Attachment: perf.pread_8k.txt
Description: Text document

Attachment: 0001-pread64-based-buffered-I-O.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]