First of all thanks for making this great app!
Recently I've been working on a project which involves generating a lot of images, processing them externally and then getting some statistics from the resulting images in Octave. The images are 4096x4096 16-bit uncompressed TIFFs, so around 100MB each. I noticed that a big chunk of time in the entire process is taken only by reading and writing the files so I started to look how I could optimize that. Here are my findings.
Currently it's faster to convert the TIFFs to raw RGB images and manually read that with fread (~1.7x faster for a 16bit image and 3x faster for an 8bit image for me)
system(["gm convert " fname " " fname ".rgb"]);
fid = fopen([fname ".rgb"]);
im_fast = permute(reshape(fread(fid, [3 Inf], "*uint16", 'ieee-be')', 4096, 4096, 3), [2 1 3]);
Thinking that hack was a bit silly, decided to look at the source code of Octave itself. So I did some profiling and noticed that a big part of the read time is spent adjusting the variable range from the GraphicsMagick internal representation to the actual bit depth. This is currently done by first converting everything to double, dividing, rounding and converting to unsigned int (I'm mostly investigating the "TrueColorType" part of read_images
). In almost all of the cases (except for 32bit file depth where for some reason Octave wants to give a normalized double result) these operations can be done by bit shifting. If a simple bit shift is done, performance improves by 3.4x for 16bit and 3x for 8bit. If it is done by having an if-branch for each shift amount (0, 8 or 24), constant shift values inside the per-pixel for-loops, it is even faster, around 5x for both 16 and 8bit. I wonder how/if this code could be written in a nice and concise way, so as not to have 3 copies of the same code. I also don't know how the 32 bit case should be handled, but it really looks like a corner case to me. I guess most users don't have GraphicsMagick compiled with quantum-depth 32 and floating-point images are quite rare. Having the user do the conversion from uint32 to float on his own might not be such a bad idea, but the thought of breaking existing code sounds bad
In the case of imwrite, things are much nicer. Here I'm mostly investigating the "TrueColorType" part of encode_uint_image.The big time-waster is the construction and destruction of a Magick::Color object for each and every pixel, which internally calls new and delete. It turns out that the output values can be written directly to the output vector, without an intermediary Color object. That alone improves performance by more than 2.5x. It gets even faster if integer operations are used (4x with multiplications), but there are two cases: the one when the Octave variables are smaller than the quantum depth and the one when they are bigger. In the first case we need to multiply with a constant dependent on the width of template type and the quantum depth, in the second we would need to shift with a value dependent on those depths, just as was done for imread. Unfortunately I don't have a lot of experience with templates so I don't know how this can be done without duplicating code.
Just a final note, I stored and read the images to and from a tmpfs filesystem, so the speedups might be a lot smaller for a HDD. In case anyone wonders (I know I did) whether division and rounding can really be done just by bitshifting here's some proof for one of the cases
v = uint8(0:2^8-1);
v_mag = uint16(v)*uint16((2^16-1)/(2^8-1));
v_rec = bitshift(v_mag, -8);
What do you think about all this? Is it worth the effort? If yes, I could try cleaning my code up, extending it for all the cases (rgb, grayscale, w/ alpha, w/o alpha etc.) and sending you a changeset? I'm not exactly sure what the process is for getting involved.
All the best,