Implement BLEU score evaluation for NLP tests #6537

peterwald · 2025-06-23T21:55:44Z

Add a new library for algorithmic NPL scoring evaluators, named Microsoft.Extensions.AI.Evaluation.NLP.

The first such evaluator that is implemented is BLEU. For this implementation we default to 4 even weights for the n-gram comparisons and we use the smoothing method 4 from Chen and Cherry (2014) for sentence-level BLEU scores. These are the same defaults chosen by the Azure python evaluation SDK.

Also included is a simple word tokenizer based on the tokenizers used for other BLEU implementations, such as MOSES, SacreBLEU and NLTK.

Microsoft Reviewers: Open in CodeFlow

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Utilities/CollectionBuilderAttribute.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/RationalNumber.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/SmoothingFunction.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/MatchCounter.cs

shyamnamboodiripad · 2025-06-26T01:52:00Z

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/SimpleWordTokenizer.cs

+/// <summary>
+/// Tokenizes a string into segments using the common rules established by the NLTK word tokenizer.
+/// </summary>
+public static class SimpleWordTokenizer


I wish we could use some pre-existing library for this and avoid adding this to our public surface area. Does Microsoft.ML.Tokenizers contain anything we could use for this? Alternatively, can we potentially move this there? Tagging @luisquintanilla for awareness.

I could not find anything suitable in Microsoft.ML.Tokenizers. This tokenizer is quite a bit simpler than most of the other word-based tokenizers and seems to only be used in the NLP space.

Also, adding another dependency would be unfortunate.

The dependency could be in the user's code - we could accept a lambda as part of the evaluator's constructor and the user could provide a lambda that calls into another library.

However, on second look, does this need to be public? Looks like it is used within the BLEUEvaluator but is there a scenario where a user would need to call this?

cc: @tarekgh

Couldn’t this be handled using a regular expression? If you gave this code to Copilot, it would likely generate a suitable regex pattern rather than requiring all of this custom logic. This is generally how pre-tokenization works in our tokenizer library as well.

The other implementations I looked at used a number of regular expressions, with slightly different replacements for each. I think the order of replacement is important. I unrolled it into this routine to make it more efficient so that we wouldn't have so many string allocations.

@tarekgh, can you point to an example pre-tokenization routine that you are referring to?

I wasn’t suggesting that you use the pre-tokenization from the tokenizer library—my point was simply to ask whether a regular expression could work in your case instead of custom parsing logic. If your implementation is more optimized in terms of string allocations, that makes sense. I just wanted to check whether you had considered regex as an alternative.

If you're still curious about how pre-tokenization works in the tokenizer library, here's a reference to the relevant code. That said, it ultimately depends on the specific regex pattern you provide.

https://github.com/dotnet/machinelearning/blob/f8611259d53ef0566c760e62bdf7fee4d32821b1/src/Microsoft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs#L16

The implementation is optimized and also leverages source generators. Additionally, we have a composite pre-tokenizer that can chain multiple pre-tokenizers to run in sequence.

Also @tarekgh we were curious if the need for word tokenization has come up on your side before. This seems to be a popular need in (older) NLP algorithms like the ones being added in this PR although it's probably much simpler than what we need to support in the ML / LLM contexts now. We were not sure how extensible the word tokenization might need to be for the NLP scores that we are implementing (i.e. would users need more control on how to tokenize strings for different NLP algorithms or different kinds of input strings). And we were also not sure if there are already existing .NET APIs that handle some of this.

For now, we decided to make the tokenization helpers above internal and wait for feedback to see if users ask for more customizability. But just curious in case you have any insights / pointers to share around this.

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/SimpleWordTokenizer.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/README.md

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Common/BLEUAlgorithm.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Common/NGramExtensions.cs

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Common/MatchCounter.cs

shyamnamboodiripad · 2025-06-26T19:29:13Z

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Directory.Build.targets

+
+internal static class Constants
+{
+    public const string Version = "$(Version)"%3B


Is Version being used in the code in this project or was this file just created as a result of a copy paste? Please delete the file if the constant is unused in the code.

All of our libraries include that version number type. Seems like it would be a good idea to retain consistency here.

I don't think all of them do - for example, the Safety package does not and neither does Quality or the core Evaluation package. It is currently only used in Reporting and Console both of which need to reference the Version in the code.

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEUEvaluator.cs

shyamnamboodiripad · 2025-06-26T19:54:54Z

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/SimpleWordTokenizer.cs

+/// <summary>
+/// Tokenizes a string into segments using the common rules established by the NLTK word tokenizer.
+/// </summary>
+[Experimental("AIEVAL001")]


As we discussed offline, let's make this internal for now. We can wait for feedback to see if we need extensibility for word tokenization.

Please remove the [Experimental] here and the <InjectExperimentalAttributeOnLegacy> from the project file as part of making this internal.

I already pushed the change to make it internal. We are using experimental on other classes, so the injection should remain.

Do we need to use Experimental on the other classes if the package itself is in preview? The tokenizer usage was special since there were questions about what the API should look like / whether it should be part of some other library.

But I don't think there are similar questions for the rest of the stuff - and we would be reviewing the evaluators and contexts before marking the package public regardless... Also, the Safety evaluators which are also in preview are not Experimental currently, so marking just the NLP ones Experimental would be confusing.

Implement BLEU score evaluation for NLP tests

bc34929

peterwald requested a review from a team as a code owner June 23, 2025 21:55

github-actions bot added the area-ai-eval Microsoft.Extensions.AI.Evaluation and related label Jun 23, 2025

dotnet-policy-service bot assigned peterwald Jun 23, 2025

peterwald added 2 commits June 24, 2025 07:55

Fix style warnings

b05ee5b

Support multiple references for a single evaluator

d3f45a4

shyamnamboodiripad reviewed Jun 25, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Utilities/CollectionBuilderAttribute.cs Outdated Show resolved Hide resolved

shyamnamboodiripad reviewed Jun 25, 2025

View reviewed changes

src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs Outdated Show resolved Hide resolved