-
-
Notifications
You must be signed in to change notification settings - Fork 7k
Support FIPS enabled machines with MD5 hashing #15299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Matthew Vine <[email protected]>
Signed-off-by: Matthew Vine <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me (cc security expert @russellb :)). You could consider making an "insecure_md5" function that provides a common wrapper for hashlib.md5(..., usedforsecurity=False)
I can do that. Let me know if you want this. |
excited to see traction on this. Thanks @mgoin and @MattTheCuber |
Thank you for the report and PR! I definitely want to get this fixed. I'm a little torn on the change. Hashing here does have security implications. See this advisory where I changed this code recently: It was previously easy to predict KV cache conflicts when using vLLM with Python 3.12. It's much more difficult now after the change since we now always start with a random base number that's different each time your run vLLM. It's a fair argument that the use of MD5 makes it more feasible to calculate and abuse hash conflicts, though it doesn't seem very feasible unless you could find out the random number used as the base of our hash calculations. I think I'm on the side of just accepting the proposed change since I think it's "secure enough" that exploiting md5 hash collisions seems not practical with our current code. I didn't want to approve the change without giving that context first, though. If you all are still comfortable with the change with the understanding that there IS a slight security implication here, then I'm OK with it. Switching to SHA1 isn't trivial since it takes up a bit more memory than MD5 hashes and is slightly more expensive to calculate. We'd really need to measure the impact of the change carefully. |
Changing to SHA1 is definitely the right solution under this context. SHA1 is FIPS certified and is the best hashing standard when needing to balance security and speed. I can send you performance evaluations later (if you want) since I don't have them handy, but SHA1 is significantly faster than MD5 hashing. I can make that swap tomorrow if you would like? I just don't know if anything externally uses the hashes and would break. If you can confirm nothing does, I don't see why to not swap. Unless you are making hundreds of thousands of hashes, I don't think memory will be a problem. |
Thanks for the response. I had a feeling consensus wouldn't stay on MD5 with that context. I'll share this PR with some others that maintain this part of the code. |
It looks like SHA-1 is getting retired from FIPS? Maybe we should look at skipping SHA-1 for something else? https://csrc.nist.gov/news/2022/nist-transitioning-away-from-sha-1-for-all-apps |
OK I need to back up. I greatly apologize, but I didn't actually read the patch and just assumed what code this was talking about and I was completely wrong. I thought this was about how we do hashing for prefix caching, but that's using Python's built in hash() function, not MD5. Please forget everything I said in the last few messages, it is not relevant here. I'm fine with this change, though switching to another algorithm should also be fine, I think. I'll go ahead and approve as-is though. |
Oh man, I didn't realize that. I guess switching to SHA2 would be the best choice? I will test performance tomorrow on SHA2 vs MD5. I know SHA1 is much faster but I don't know about SHA2. Although, unless you are doing hundreds of thousands of hashes in a short time, the slowdown will likely be insignificant. |
Thanks again for raising this. Let me know if you want to talk about any other issues that affect security compliance! |
Signed-off-by: Matthew Vine <[email protected]>
Signed-off-by: Matthew Vine <[email protected]> Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Matthew Vine <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
Signed-off-by: Matthew Vine <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Matthew Vine <[email protected]>
FIPS enabled machines prohibit MD5 hashing for security reasons. Using
hashlib.md5
in Python will throw the errorValueError: [digital envelope routines] unsupported
. There are typically 2 resolutions to this problem: use theusedforsecrutiy=False
flag or use SHA1 hashing. Theusedforsecurity
flag is easier to implement and was introduced in Python 3.9, which is the minimum Python version for this library, so that is the method I chose.