-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: Support CUDA graphs for EAGLE3 #3176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1b6ec07
to
0a61d1c
Compare
Putting this out to get early feedback and ideas. I'm not really happy with the design right now. Specifically, passing the states between the target and draft models gets pretty complicated when CUDA graphs enter the picture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits.
Had some discussion offline with @hlu1 about how to make it cleaner:
I think (2) should definitely be done now. (1) is a pretty big refactor and would probably have to be done in followups. |
0a61d1c
to
a319461
Compare
/bot run --disable-fail-fast |
PR_Github #2196 [ run ] triggered by Bot |
a319461
to
e95d94d
Compare
PR_Github #2196 [ run ] completed with state |
e95d94d
to
da0ba7e
Compare
/bot run --disable-fail-fast |
PR_Github #2212 [ run ] triggered by Bot |
PR_Github #2212 [ run ] completed with state |
da0ba7e
to
42bec34
Compare
/bot run --disable-fail-fast |
PR_Github #2348 [ run ] triggered by Bot |
PR_Github #2348 [ run ] completed with state |
42bec34
to
782d4a9
Compare
/bot run --disable-fail-fast |
PR_Github #2357 [ run ] triggered by Bot |
782d4a9
to
ae204bc
Compare
/bot run --disable-fail-fast |
PR_Github #2363 [ run ] triggered by Bot |
PR_Github #2357 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve to unblock.
PR_Github #2363 [ run ] completed with state |
ae204bc
to
0967026
Compare
Signed-off-by: Mike Iovine <[email protected]>
0967026
to
0d87267
Compare
/bot run --disable-fail-fast |
Running CI one more time before merging as it has been a while since my last rebase. |
PR_Github #2497 [ run ] triggered by Bot |
PR_Github #2497 [ run ] completed with state |
/bot skip --comment "Pipeline passed before last rebase" |
PR_Github #2510 [ skip ] triggered by Bot |
PR_Github #2510 [ skip ] completed with state |
Support CUDA graphs for the EAGLE-3 spec decode. Also contains fixes for loading eagle 3 weights for llama3 70B models (previously only tested for 8b).
The graphs significantly improve the performance. However, we still have a lot of work to do to eliminate the host overheads.