Skip to content

Improve Serpent performance beyond SSE2. #1520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MarekKnapek opened this issue Apr 6, 2025 · 1 comment
Open

Improve Serpent performance beyond SSE2. #1520

MarekKnapek opened this issue Apr 6, 2025 · 1 comment

Comments

@MarekKnapek
Copy link

Hi, I looked around the VeraCrypt source code of the Serpent cipher and I found out that there is an optimization implemented. This optimization is based on the Speeding up Serpent paper by Dag Arne Osvik.

The optimization works with the SSE2 instruction set. The SSE2 uses 128bit registers. The Osvik optimization uses 32bit integers. So the SSE2 optimization loads up 4 blocks at once (each block is 128bits), shuffles them around in such a way that first SSE2 register holds the first 1/4 of each block to be encrypted, second SSE2 register holds second 1/4 of each block, and so on. Then the same bitwise operations as in Osvik paper are performed, but instead of using 32bit integers, 128bit registers are used. Then there needs reverse shuffle to be performed. And ta-da, we just encrypted (or decrypted) 4 128bit blocks instead of single 128bit block by using 128bit registers instead of 32bit integers.

This could be easily extended from SSE2 (128bit) to AVX2 (256bit) or to AVX512 (512bit). Instead of loading 4 blocks at once (each block is 128bits), load 8 or 16 blocks at once. The shuffling would move the first 1/4 of each block into the first AVX2 or AVX512 register, second 1/4 of each block into second wide register, and so on. Then the same bitwise operations as in Osvik paper could be performed, but instead of using 32bit integers, wide registers would be used. Finishing with reverse shuffle, where 4 wide registers will be written back to 8 or 16 blocks in main memory.

I already implemented such optimization for my own cryptographic project and I measured roughly 2x speedup when going from SSE2 to AVX2. I don't have AVX512 capable computer, I borrowed one for a while and I didn't notice any significant speedup when going from AVX2 to AVX512. This might be caused by the fact that the test was performed on a "weak" but otherwise massively parallel Intel processor which splits each 512bit operation into two separate 256bit operations.

Anyway, I believe that this suggestion could significantly improve performance of the Serpent encryption cipher algorithm. Roughly 2x, or even more if used with recent enough Intel or AMD processor.

@MarekKnapek
Copy link
Author

The general case could be also improved. By using 64bit integers, encrypting two blocks at a time.

Also ARM NEON might be used, but I don't know NEON well enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant