Skip to content

Implement BLEU score evaluation for NLP tests #6537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

peterwald
Copy link
Member

@peterwald peterwald commented Jun 23, 2025

Add a new library for algorithmic NPL scoring evaluators, named Microsoft.Extensions.AI.Evaluation.NLP.

The first such evaluator that is implemented is BLEU. For this implementation we default to 4 even weights for the n-gram comparisons and we use the smoothing method 4 from Chen and Cherry (2014) for sentence-level BLEU scores. These are the same defaults chosen by the Azure python evaluation SDK.

Also included is a simple word tokenizer based on the tokenizers used for other BLEU implementations, such as MOSES, SacreBLEU and NLTK.

Microsoft Reviewers: Open in CodeFlow

@peterwald peterwald requested a review from a team as a code owner June 23, 2025 21:55
@github-actions github-actions bot added the area-ai-eval Microsoft.Extensions.AI.Evaluation and related label Jun 23, 2025
/// <summary>
/// Tokenizes a string into segments using the common rules established by the NLTK word tokenizer.
/// </summary>
public static class SimpleWordTokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could use some pre-existing library for this and avoid adding this to our public surface area. Does Microsoft.ML.Tokenizers contain anything we could use for this? Alternatively, can we potentially move this there? Tagging @luisquintanilla for awareness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find anything suitable in Microsoft.ML.Tokenizers. This tokenizer is quite a bit simpler than most of the other word-based tokenizers and seems to only be used in the NLP space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, adding another dependency would be unfortunate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency could be in the user's code - we could accept a lambda as part of the evaluator's constructor and the user could provide a lambda that calls into another library.

However, on second look, does this need to be public? Looks like it is used within the BLEUEvaluator but is there a scenario where a user would need to call this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @tarekgh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn’t this be handled using a regular expression? If you gave this code to Copilot, it would likely generate a suitable regex pattern rather than requiring all of this custom logic. This is generally how pre-tokenization works in our tokenizer library as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other implementations I looked at used a number of regular expressions, with slightly different replacements for each. I think the order of replacement is important. I unrolled it into this routine to make it more efficient so that we wouldn't have so many string allocations.

@tarekgh, can you point to an example pre-tokenization routine that you are referring to?

Copy link
Member

@tarekgh tarekgh Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn’t suggesting that you use the pre-tokenization from the tokenizer library—my point was simply to ask whether a regular expression could work in your case instead of custom parsing logic. If your implementation is more optimized in terms of string allocations, that makes sense. I just wanted to check whether you had considered regex as an alternative.

If you're still curious about how pre-tokenization works in the tokenizer library, here's a reference to the relevant code. That said, it ultimately depends on the specific regex pattern you provide.

https://github.com/dotnet/machinelearning/blob/f8611259d53ef0566c760e62bdf7fee4d32821b1/src/Microsoft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs#L16

The implementation is optimized and also leverages source generators. Additionally, we have a composite pre-tokenizer that can chain multiple pre-tokenizers to run in sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also @tarekgh we were curious if the need for word tokenization has come up on your side before. This seems to be a popular need in (older) NLP algorithms like the ones being added in this PR although it's probably much simpler than what we need to support in the ML / LLM contexts now. We were not sure how extensible the word tokenization might need to be for the NLP scores that we are implementing (i.e. would users need more control on how to tokenize strings for different NLP algorithms or different kinds of input strings). And we were also not sure if there are already existing .NET APIs that handle some of this.

For now, we decided to make the tokenization helpers above internal and wait for feedback to see if users ask for more customizability. But just curious in case you have any insights / pointers to share around this.

@peterwald peterwald requested a review from a team as a code owner June 26, 2025 11:20

internal static class Constants
{
public const string Version = "$(Version)"%3B
Copy link
Contributor

@shyamnamboodiripad shyamnamboodiripad Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Version being used in the code in this project or was this file just created as a result of a copy paste? Please delete the file if the constant is unused in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of our libraries include that version number type. Seems like it would be a good idea to retain consistency here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think all of them do - for example, the Safety package does not and neither does Quality or the core Evaluation package. It is currently only used in Reporting and Console both of which need to reference the Version in the code.

/// <summary>
/// Tokenizes a string into segments using the common rules established by the NLTK word tokenizer.
/// </summary>
[Experimental("AIEVAL001")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, let's make this internal for now. We can wait for feedback to see if we need extensibility for word tokenization.

Please remove the [Experimental] here and the <InjectExperimentalAttributeOnLegacy> from the project file as part of making this internal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already pushed the change to make it internal. We are using experimental on other classes, so the injection should remain.

Copy link
Contributor

@shyamnamboodiripad shyamnamboodiripad Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use Experimental on the other classes if the package itself is in preview? The tokenizer usage was special since there were questions about what the API should look like / whether it should be part of some other library.

But I don't think there are similar questions for the rest of the stuff - and we would be reviewing the evaluators and contexts before marking the package public regardless... Also, the Safety evaluators which are also in preview are not Experimental currently, so marking just the NLP ones Experimental would be confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-ai-eval Microsoft.Extensions.AI.Evaluation and related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants