feat: FP8 Rowwise quantization support for Cohere models #3127

aikitoria · 2025-03-27T15:42:58Z

This adds FP8 support for the LayerNorm kernel in the same way as was done for the RmsNorm kernel, which then allows us to use FP8 Rowwise quantization with the Cohere models.

For previous discussion, see #2912

Signed-off-by: aikitoria <[email protected]>

juney-nvidia · 2025-03-28T00:06:31Z

/bot run

juney-nvidia · 2025-03-28T00:07:09Z

@QiJune @ming-wei pls help review this MR.

juney-nvidia · 2025-03-28T05:59:17Z

/bot run

tensorrt-cicd · 2025-03-28T06:07:36Z

PR_Github #672 [ run ] triggered by Bot

tensorrt-cicd · 2025-03-28T06:15:19Z

PR_Github #672 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #565 completed with status: 'FAILURE'

aikitoria · 2025-03-28T14:37:55Z

It looks like the CI failed, but the links go to some internal domains, so I can't see what the error is. I have some ideas what it might be... I probably need to update other usages of the LayerNorm quantization plugin to handle the new parameters.

juney-nvidia · 2025-03-29T12:05:41Z

blossom-ci

@aikitoria you code failed to pass the pre-commit check.

Currently the pre-commit check failure will not be copied back to public to be viewable and we are working to improve it with this MR:

For now I just manually copy the error message to provide quick feedback:

You can also refer here to do the pre-commit check locally in your own dev environment.

June

aikitoria · 2025-03-29T12:55:59Z

Oh I see, I will fix the formatting for both PRs

ming-wei

Thank you for the contribution!

I've left a few comments, but the PR looks overall good.

@juney-nvidia It'd be good if we can find someone familiar with quantization support. I personally don't have hands-on quantization experience, so I might miss something.

ming-wei · 2025-03-31T02:09:17Z

cpp/tensorrt_llm/kernels/layernormKernels.cu

-    int tokens, int hidden_dim, float const* scale_orig_quant_per_tensor, float* scale_orig_quant_per_token,
-    int8_t* normed_output_quant, bool use_shmem)
+    int tokens, int hidden_dim, float const* clampPtr, float const* scale_orig_quant_per_tensor,
+    float* scale_orig_quant_per_token, float* sum_per_token, QuantT* normed_output_quant, bool hasFp8MinScaling)


hasFp8MinScaling -> has_fp8_min_scaling

clampPtr -> clamp_ptr

to keep the coding style consistent with other params?

ming-wei · 2025-03-31T02:14:36Z

cpp/tensorrt_llm/kernels/layernormKernels.cu

-    float* scale_orig_quant_per_token, int8_t* normed_output_quant, const dim3 grid, const dim3 block,
-    const size_t shmem_size, cudaStream_t stream)
+    float const eps, int tokens, int hidden_dim, float const* clampPtr, float const* scale_orig_quant_per_tensor,
+    float* scale_orig_quant_per_token, float* sum_per_token, QuantT* normed_output_quant, bool const hasFp8MinScaling,


ditto, naming convention issue. Please also check other occurrences of clampPtr/hasFp8MinScaling.

cpp/tensorrt_llm/plugins/layernormQuantizationPlugin/layernormQuantizationPlugin.cpp

ming-wei · 2025-03-31T02:26:41Z

cpp/tensorrt_llm/plugins/layernormQuantizationPlugin/layernormQuantizationPlugin.cpp

    }
-    // Dynamic scaling if enabled
-    return (inOut[pos].type == nvinfer1::DataType::kFLOAT) && (inOut[pos].format == TensorFormat::kLINEAR);
+    else if (pos == 5 + int(mClampValEnabled))


I'm feeling like it'd be more clear if we treat pos as the position relative to input/output starting index:

if (pos < nbInputs) { // pos is 0-based input pos. if (pos < 3) { ... } ... } else { pos -= nbInputs; // pos is 0-based output pos. if (pos == 0) { // Quantized output ... } ... } ...

}

ming-wei · 2025-03-31T02:31:21Z

cpp/tensorrt_llm/plugins/layernormQuantizationPlugin/layernormQuantizationPlugin.cpp

@@ -185,14 +270,25 @@ int LayernormQuantizationPlugin::enqueue(nvinfer1::PluginTensorDesc const* input
 nvinfer1::DataType LayernormQuantizationPlugin::getOutputDataType(
    int index, nvinfer1::DataType const* inputTypes, int nbInputs) const noexcept
 {
-    assert((mDynActScaling && index < 2) || (!mDynActScaling && index == 0));
+    assert(index <= 2);


Shouldn't this check depend on the value of mDynActScaling and mSumPerToken?

Besides, I'd use index < 3 instead of index <= 2 if there are 3 outputs in total.

juney-nvidia · 2025-03-31T06:29:36Z

Thank you for the contribution!

I've left a few comments, but the PR looks overall good.

@juney-nvidia It'd be good if we can find someone familiar with quantization support. I personally don't have hands-on quantization experience, so I might miss something.

Sure, I just added @Tracin into the code reviewer loop.

Thanks
June

Tracin · 2025-03-31T07:02:31Z

cpp/tensorrt_llm/plugins/layernormQuantizationPlugin/layernormQuantizationPlugin.h

+    bool mClampValEnabled;
+    // The quantization mode.
+    tensorrt_llm::common::QuantMode mQuantMode;
+    // Should we output the sum of channels per-token? (Used by QServe GEMM)


Should we output sum in this scenario? @bobboli

To my knowledge, per-token sum is only used by QServe's per-channel w4a8 GEMM. fp8 rowwise should not need this.

wm2012011492 · 2025-03-31T08:04:25Z

Hi @aikitoria , would you mind adding an functional unittest like tests/unittest/trt/quantization/test_smooth_quant_layer_norm.py? And it would be better to add an example usage in examples/commandr/README.md. Thanks.

wm2012011492 · 2025-03-31T12:21:08Z

examples/commandr/convert_checkpoint.py

+        default=False,
+        help="Enable FP8 per-token per-channel quantization")
+    parser.add_argument(
+        "--use_meta_fp8_rowwise_recipe",


Hi @aikitoria , use_meta_fp8_rowwise_recipe cannot be enabled here because Cohere model only has input_layernorm which is directly followed by the MLP layer. If use_meta_fp8_rowwise_recipe is enabled, the input_layernorm will be excluded from quantization and only generate one output while the following Fp8RowwiseFusedGatedMLP requires two tensors (quantized_input and scale) . So it's better to delete the argument.

wm2012011492 · 2025-03-31T12:22:27Z

examples/commandr/convert_checkpoint.py

+        quant_config.quant_algo = QuantAlgo.FP8
+    elif args.use_fp8_rowwise:
+        quant_config.quant_algo = QuantAlgo.FP8_PER_CHANNEL_PER_TOKEN
+        quant_config.use_meta_recipe = args.use_meta_fp8_rowwise_recipe


Please also delete this line to avoid confusion. Thanks

ming-wei · 2025-04-08T01:41:42Z

@aikitoria any update on this?

aikitoria · 2025-04-08T01:43:08Z

Sorry, I have been busy at work, I will come back to this this week!

Edit: Still haven't been able to get to it

aikitoria added 2 commits March 27, 2025 16:31

feat: Fp8RowwiseLayerNorm like Fp8RowwiseRmsNorm

bcedbe7

Signed-off-by: aikitoria <[email protected]>

feat: FP8 and FP8 Rowwise quantization for Cohere models

aae91c2

Signed-off-by: aikitoria <[email protected]>

aikitoria mentioned this pull request Mar 27, 2025

Support Cohere Command-A (Cohere2ForCausalLM arch) #2912

Open

juney-nvidia requested review from ming-wei and QiJune March 28, 2025 00:06

juney-nvidia added the Community want to contribute label Mar 28, 2025

ming-wei requested changes Mar 31, 2025

View reviewed changes

juney-nvidia requested review from Tracin and wm2012011492 and removed request for QiJune March 31, 2025 06:13

Tracin reviewed Mar 31, 2025

View reviewed changes

byshiue self-requested a review March 31, 2025 07:22

wm2012011492 reviewed Mar 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: FP8 Rowwise quantization support for Cohere models #3127

feat: FP8 Rowwise quantization support for Cohere models #3127

aikitoria commented Mar 27, 2025

juney-nvidia commented Mar 28, 2025

juney-nvidia commented Mar 28, 2025

juney-nvidia commented Mar 28, 2025

tensorrt-cicd commented Mar 28, 2025

tensorrt-cicd commented Mar 28, 2025

aikitoria commented Mar 28, 2025

juney-nvidia commented Mar 29, 2025 •

edited

Loading

aikitoria commented Mar 29, 2025

ming-wei left a comment

ming-wei Mar 31, 2025

ming-wei Mar 31, 2025

ming-wei Mar 31, 2025

ming-wei Mar 31, 2025

juney-nvidia commented Mar 31, 2025

Tracin Mar 31, 2025

bobboli Mar 31, 2025

wm2012011492 commented Mar 31, 2025 •

edited

Loading

wm2012011492 Mar 31, 2025

wm2012011492 Mar 31, 2025

ming-wei commented Apr 8, 2025

aikitoria commented Apr 8, 2025 •

edited

Loading

feat: FP8 Rowwise quantization support for Cohere models #3127

Are you sure you want to change the base?

feat: FP8 Rowwise quantization support for Cohere models #3127

Conversation

aikitoria commented Mar 27, 2025

juney-nvidia commented Mar 28, 2025

juney-nvidia commented Mar 28, 2025

juney-nvidia commented Mar 28, 2025

tensorrt-cicd commented Mar 28, 2025

tensorrt-cicd commented Mar 28, 2025

aikitoria commented Mar 28, 2025

juney-nvidia commented Mar 29, 2025 • edited Loading

aikitoria commented Mar 29, 2025

ming-wei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juney-nvidia commented Mar 31, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wm2012011492 commented Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ming-wei commented Apr 8, 2025

aikitoria commented Apr 8, 2025 • edited Loading

juney-nvidia commented Mar 29, 2025 •

edited

Loading

wm2012011492 commented Mar 31, 2025 •

edited

Loading

aikitoria commented Apr 8, 2025 •

edited

Loading