Skip to content

getImageData and putImageData could be much faster #909

Open
@pcordes

Description

@pcordes

NAN_METHOD(Context2d::GetImageData) and PutImageData could be much faster (at least a factor of 2, or much more with SIMD) by loading/storeing whole pixels as uint32_t. Use integer rotates and stuff to do the format conversions. Use an endian-conversion function that can inline (to a single bswap instruction for x86, for example), since compilers don't do a good job with single-byte loads/stores.


gcc doesn't merge the single-byte loads and stores into 32-bit stores (even on x86 where unaligned loads/stores never fault and are fast), so that's a major bottleneck. (If you look at the asm, it's not pretty).

Transforming ARGB to/from RGBA can be done with a single 32-bit rotate by 8 bits. See this SO post for a portable + safe rotate idiom that reliably compiles to a single rotate instruction on most compilers.

To handle the portable-to-native endian conversion, use something like uint32_t ntohl(uint32_t) from <arpa/inet.h> if available. (On any decent implementation, these functions will inline. e.g. on x86 gcc, it inlines to a single bswap instruction). I'm not sure about Windows, or if we'd need an #ifdef to get a different endian-conversion function / intrinsic.

In summary, the get and put image data functions should both be doing 32-bit loads and stores, and doing an endian-conversion + a rotate. The current GetImageData should run at about 1 byte per clock cycle on recent Intel CPUs (as compiled by gcc6.3.1 for x86-64), for ranges of pixels that have alpha == 0 or 255. Maybe even slower on AMD Bulldozer-family CPUs, since they have fewer integer execution units.

A 32-bit-at-a-time bswap+rotate version should be at least twice as fast. It probably can't manage 1 pixel per clock since it still has to check alpha. (Although you can do that without unpacking the alpha with a shift, by checking if pix >= (0xffUL << 24) || pix < (1UL << 24) in a format where we know alpha is in the MSB 8 bits.)

The floating-point stuff for alpha looks ridiculously slow. Probably an integer multiply and shift would be a win. (Or integer division by 255, which compilers can implement with multiply and shift.) At the very least, GetImageData should do float alpha = 255.0f / (float)a and then three multiplies, instead of the other way around and three divides. (Multiply is much faster than divide).

I guess I should just write the code and submit a pull request.


Of course, the alpha stuff could be done much faster with SIMD (floating-point or just unpack to 16-bit integer like most SIMD graphics code does). You don't even need to do runtime CPU detection to get a big speedup on x86-64, because SSE2 is baseline.

SSSE3 would be extremely useful though (for a byte-shuffle), and could speed up the easy-alpha case by maybe another factor of 4 (16 bytes at a time with a fallback if any of the 4 pixels have a non-easy alpha). Without pshufb (_mm_shuffle_epi8), you can use SSE2 bit-shifts and boolean (AND/OR/XOR) ops to exchange R and B for 4 pixels in a 16-byte vector. (Which is what a little-endian platform needs).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions