|
From: | apache2 |
Subject: | Re: [DISCUSSION] FILE* slowness |
Date: | Sun, 17 Jul 2022 11:06:15 +0200 |
User-agent: | Mutt/1.9.3 (2018-01-21) |
OK, so here is some more analysis. I started out with the "stupid" implementation that mapped (ios_dev_file_pread) and (ios_dev_file_pwrite) directly to (pread) and(pwrite). This was, not totally unexpectedly, slower than our current FILE*-based code with the ftello patch applied. Presumably because `fcntl(SEEK_CUR)` is faster than `pread64`. Then I added a buffer. That made things look a little bit better. Here are the results of `time` on my benchmark, as executed on my shitty old laptop with a bunch of other stuff running in the background. Ie YMMV. I use a work-in-progress `zfs.pk` (available in the Pokology repo) that reads a lot of uint32s to compute the Fletcher4 checksum used in the `ZFS send` file format. I thought this checksum pickle would be a good benchmark, both because most of the heavy lifting consists of contiguous reads, but also because it would be more likely to catch bugs in the implementation (by way of complaining about incorrect checksums). Here bash's `time` builtin, three successive runs: FILE* (ftello patch): real 1m11.913s user 1m1.259s sys 0m0.525s -- real 1m13.129s user 1m1.132s sys 0m0.458s -- real 1m9.563s user 1m0.466s sys 0m0.514s pread (4096 byte buffer): real 1m5.867s user 0m56.473s sys 0m0.477s -- real 0m57.814s user 0m52.453s sys 0m0.472s -- real 0m59.327s user 0m53.165s sys 0m0.468s pread (8192 byte buffer): real 1m6.722s user 0m56.645s sys 0m0.345s -- real 1m3.726s user 0m55.380s sys 0m0.391s -- real 1m0.855s user 0m54.288s sys 0m0.435s Attached you'll find three perf reports (50 first lines), these are based on current master: commit 273c468495f71b1f08e7a7169928ef40e8a3c184 (origin/master, origin/HEAD, gnu/master, master) Author: Jose E. Marchesi <jose.marchesi@oracle.com> Date: Thu Jul 14 19:00:39 2022 +0200 remove etc/pk-mi-json-schema.json - perf.FILE.txt is the current master version - perf.pread_4k.txt is for the attached pread patch with 4k buffers - perf.pread_8k.txt is for the attached pread patch with 8k buffers In perf.FILE.txt we see the following relevant entries: 3.74% 10851 poke libc-2.33.so [.] __ftello64 3.55% 10301 poke libc-2.33.so [.] _IO_fread 3.43% 9941 poke libc-2.33.so [.] _IO_ferror 0.61% 1777 poke libc-2.33.so [.] _IO_file_seekoff@@GLIBC_2.2.5 0.69% 2006 poke libpoke.so.0.0.0 [.] ios_dev_file_pread 0.60% 1750 poke libc-2.33.so [.] _IO_file_xsgetn 0.49% 1410 poke libpoke.so.0.0.0 [.] ios_dev_file_get_flags 0.40% 1160 poke libc-2.33.so [.] _IO_seekoff_unlocked This sums up to approximately 13.51% of the runtime. I did not filter the callgraph for calls from ios_dev_file_pread, but I believe it's representative enough. In perf.pread_4k.txt we instead see: 1.84% 4591 poke libpoke.so.0.0.0 [.] ios_dev_file_pread 0.48% 1199 poke libpoke.so.0.0.0 [.] ios_dev_file_get_flags And in perf.pread_8k.txt: 1.85% 4738 poke libpoke.so.0.0.0 [.] ios_dev_file_pread 0.53% 1351 poke libpoke.so.0.0.0 [.] ios_dev_file_get_flags So we are now spending ~approximately 2% of the time dealing with file reading instead of 13.5%. An interesting sidenote is that we have a lot of calls to ios_dev_file_get_flags; if we do not expect these to change (I don't think the API allows this?), we could catch that so libpoke doesn't have to call that function for each and every read. Something to look into in the future. We also see that the relative cost of (the time spent in) most of the other expensive functions have gone up comparing perf.FILE.txt to perf.pread_4k.txt: - 13.14% 38114 poke libc-2.33.so [.] __memmove_avx_unaligned_erms_rtm + 15.30% 38181 poke libc-2.33.so [.] __memmove_avx_unaligned_erms_rtm - 10.86% 31485 poke libpoke.so.0.0.0 [.] pvm_execute_or_initialize + 12.37% 30873 poke libpoke.so.0.0.0 [.] pvm_execute_or_initialize - 7.94% 23035 poke libc-2.33.so [.] __memset_avx2_erms_rtm + 9.29% 23187 poke libc-2.33.so [.] __memset_avx2_erms_rtm - 6.44% 18683 poke libgc.so.1.4.3 [.] GC_malloc_kind + 7.83% 19541 poke libgc.so.1.4.3 [.] GC_malloc_kind - 7.01% 20329 poke libgc.so.1.4.3 [.] GC_mark_from + 6.07% 15151 poke libgc.so.1.4.3 [.] GC_mark_from This, seen in the context of the fact that execution took about 20% less time wall clock time, is indicative that we are onto a beneficial optimization. Comparing the two buffer sizes we learn that my benchmark was probably not good enough; the statistics are inconclusive and the differences do not look statistically significant to me. It does look like 8k might be slightly faster, but in my example pread_4k won, and I wouldn't be willing to bet much on either being superior without better data, including more varied benchmarks. I wonder if we have something like that to catch performance regressions? One way to obtain better data would be to increase the sampling rate of `perf`, or to switch to tracing instead. I looked at the access patterns by tracing the pread patch with strace and the following awk script: strace --string-limit=0 -f -e pread64 ./run poke \ -c 'load "../pokology/pickles/zfs.pk";' -c 'open("my.zfs-send")' \ -c "var x=(ZFS_send @ 0 : 0#b);" -c "x.checksum" \ |& tr -d ',)' \ | awk 'BEGIN{last=0; successive=0;}; /pread64/ && /pid/ {if ($6 == last){succ++;}else{succ=0;} last=$6+$8; print substr($3,9)"\t"$5"\t"$6"\t"$8"\t"succ;} END{print last}' This counts the number of successive reads and can be used to get an idea of how we're doing. If our general access pattern was scattered and all over the map we'd see "0" for most of the columns; but Moving forward, I propose the attached patch to replace the default FILE*-based iod subsystem. The way Poke currently utilizes this subsystem is in many (the majority of) cases to read 1 byte from a given offset, at least in my benchmark. Looking at zfs.pk it looks like it *should* read in increments of 4 bytes, but I don't know if endianness (in zfs.pk set to big-endian; my processor is little-endian) somehow messes that up. I tried to figure out if I could make it read things in larger batches, but as far as I can tell the reads come from inside PVM, and I don't know enough about that part of Poke to make intelligent suggestions other than "maybe we could see if we could read more data at a time". Lastly, the code attached to this PR does not do write caching, and instead of updating the buffer when writing it invalidates the read cache. I have not benchmarked writing, but we should come up with a benchmark for that so we can work on the pwrite side of things. There are also a number of outstanding problems / discussions points marked with TODO in the attached patch, it would be nice if we could discuss them too before merging. :-) ~ap
perf.FILE.txt
Description: Text document
perf.pread_4k.txt
Description: Text document
perf.pread_8k.txt
Description: Text document
0001-pread64-based-buffered-I-O.patch
Description: Text Data
[Prev in Thread] | Current Thread | [Next in Thread] |