Improve Serpent performance beyond SSE2. #1520

MarekKnapek · 2025-04-06T00:08:06Z

Hi, I looked around the VeraCrypt source code of the Serpent cipher and I found out that there is an optimization implemented. This optimization is based on the Speeding up Serpent paper by Dag Arne Osvik.

The optimization works with the SSE2 instruction set. The SSE2 uses 128bit registers. The Osvik optimization uses 32bit integers. So the SSE2 optimization loads up 4 blocks at once (each block is 128bits), shuffles them around in such a way that first SSE2 register holds the first 1/4 of each block to be encrypted, second SSE2 register holds second 1/4 of each block, and so on. Then the same bitwise operations as in Osvik paper are performed, but instead of using 32bit integers, 128bit registers are used. Then there needs reverse shuffle to be performed. And ta-da, we just encrypted (or decrypted) 4 128bit blocks instead of single 128bit block by using 128bit registers instead of 32bit integers.

This could be easily extended from SSE2 (128bit) to AVX2 (256bit) or to AVX512 (512bit). Instead of loading 4 blocks at once (each block is 128bits), load 8 or 16 blocks at once. The shuffling would move the first 1/4 of each block into the first AVX2 or AVX512 register, second 1/4 of each block into second wide register, and so on. Then the same bitwise operations as in Osvik paper could be performed, but instead of using 32bit integers, wide registers would be used. Finishing with reverse shuffle, where 4 wide registers will be written back to 8 or 16 blocks in main memory.

I already implemented such optimization for my own cryptographic project and I measured roughly 2x speedup when going from SSE2 to AVX2. I don't have AVX512 capable computer, I borrowed one for a while and I didn't notice any significant speedup when going from AVX2 to AVX512. This might be caused by the fact that the test was performed on a "weak" but otherwise massively parallel Intel processor which splits each 512bit operation into two separate 256bit operations.

Anyway, I believe that this suggestion could significantly improve performance of the Serpent encryption cipher algorithm. Roughly 2x, or even more if used with recent enough Intel or AMD processor.

MarekKnapek · 2025-04-06T01:11:49Z

The general case could be also improved. By using 64bit integers, encrypting two blocks at a time.

Also ARM NEON might be used, but I don't know NEON well enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Serpent performance beyond SSE2. #1520

Improve Serpent performance beyond SSE2. #1520

MarekKnapek commented Apr 6, 2025

MarekKnapek commented Apr 6, 2025

Improve Serpent performance beyond SSE2. #1520

Improve Serpent performance beyond SSE2. #1520

Comments

MarekKnapek commented Apr 6, 2025

MarekKnapek commented Apr 6, 2025