Interest in Project 14 - Accelerating Inference of NNCF-Compressed LLMs with Triton #29520

arkhamHack · 2025-03-17T15:45:58Z

arkhamHack
Mar 17, 2025

Hi @alexsu52 @AlexanderDokuchaev,
I am Avigyan Sinha a junior ML developer and I graduated last year. This project looks really interesting, and I’d love to understand more about the scope and technical expectations.
I have worked with Cuda kernels and LLM pipelines and architecture. I hope to contribute well to this project and would love some guidance.
A few things I’d like to clarify:

Will the focus be primarily on server-grade GPUs (A100/H100), or will there also be optimizations for consumer GPUs like the RTX 30/40 series? Also will there be any infra constraints on us?
Custom Kernel Design – Are we expected to write Triton kernels from scratch, or will the project focus on adapting existing ones (GemLite, TinyGEMM, Marlin, AutoGPTQ)?
Quantization Strategy – Is there a priority on dynamic INT8 group quantization for better runtime efficiency, or will the focus be on static formats like INT4, NF4, FP4?
Would this project involve torch.compile custom ops to ensure seamless PyTorch compatibility?
Are there specific LLMs or datasets we should benchmark against, or is that left open?

This lines up well with my interest in LLM optimization and performance engineering, and I’d love to contribute to making compressed model inference faster across different hardware platforms. Looking forward to your thoughts!

Best,
Avigyan Sinha

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interest in Project 14 - Accelerating Inference of NNCF-Compressed LLMs with Triton #29520

{{title}}

Replies: 0 comments

Select a reply

Interest in Project 14 - Accelerating Inference of NNCF-Compressed LLMs with Triton #29520

arkhamHack Mar 17, 2025

Replies: 0 comments

arkhamHack
Mar 17, 2025