getImageData and putImageData could be much faster

`NAN_METHOD(Context2d::GetImageData)` and `PutImageData` could be much faster (at least a factor of 2, or much more with SIMD) by loading/storeing whole pixels as uint32_t.  Use integer rotates and stuff to do the format conversions.  Use an endian-conversion function that can inline (to a single `bswap` instruction for x86, for example), since compilers don't do a good job with single-byte loads/stores.

---

gcc doesn't merge the [single-byte loads and stores](https://github.com/Automattic/node-canvas/blob/master/src/CanvasRenderingContext2d.cc#L712) into 32-bit stores (even on x86 where unaligned loads/stores never fault and are fast), so that's a major bottleneck.  (If you look at the asm, it's not pretty).

Transforming ARGB to/from RGBA can be done with a single 32-bit rotate by 8 bits.  See [this SO post](http://stackoverflow.com/questions/776508/best-practices-for-circular-shift-rotate-operations-in-c) for a portable + safe rotate idiom that reliably compiles to a single rotate instruction on most compilers.

To handle the portable-to-native endian conversion, use something like [`uint32_t ntohl(uint32_t)`](https://linux.die.net/man/3/htonl) from `<arpa/inet.h>` if available.  (On any decent implementation, these functions will inline.  e.g. on x86 gcc, it inlines to a single [`bswap` instruction](https://hjlebbink.github.io/x86doc/html/BSWAP.html)).  I'm not sure about Windows, or if we'd need an `#ifdef` to get a different endian-conversion function / intrinsic.

In summary, the get and put image data functions should both be doing 32-bit loads and stores, and doing an endian-conversion + a rotate.  The current GetImageData should run at about 1 byte per clock cycle on recent Intel CPUs (as compiled by gcc6.3.1 for x86-64), for ranges of pixels that have alpha == 0 or 255.  Maybe even slower on AMD Bulldozer-family CPUs, since they have fewer integer execution units.

A 32-bit-at-a-time bswap+rotate version should be at least twice as fast.  It probably can't manage 1 pixel per clock since it still has to check alpha.  (Although you can do that without unpacking the alpha with a shift, by checking if `pix >= (0xffUL << 24) || pix < (1UL << 24)` in a format where we know alpha is in the MSB 8 bits.)

The floating-point stuff for alpha looks ridiculously slow.  Probably an integer multiply and shift would be a win.  (Or integer division by 255, which compilers can implement with multiply and shift.)  At the very least, GetImageData should do `float alpha = 255.0f / (float)a` and then three multiplies, instead of the other way around and three divides.  (Multiply is much faster than divide).

I guess I should just write the code and submit a pull request.

---

Of course, the alpha stuff could be done much faster with SIMD (floating-point or just unpack to 16-bit integer like most SIMD graphics code does).  You don't even need to do runtime CPU detection to get a big speedup on x86-64, because SSE2 is baseline.

SSSE3 would be extremely useful though (for a byte-shuffle), and could speed up the easy-alpha case by maybe another factor of 4 (16 bytes at a time with a fallback if any of the 4 pixels have a non-easy alpha).  Without pshufb (`_mm_shuffle_epi8`), you can use SSE2 bit-shifts and boolean (AND/OR/XOR) ops to exchange R and B for 4 pixels in a 16-byte vector.  (Which is what a little-endian platform needs).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

getImageData and putImageData could be much faster #909

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

getImageData and putImageData could be much faster #909

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions