-
Notifications
You must be signed in to change notification settings - Fork 816
Implement BLEU score evaluation for NLP tests #6537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Utilities/CollectionBuilderAttribute.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/NGram.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/RationalNumber.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/RationalNumber.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/RationalNumber.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/RationalNumber.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/SmoothingFunction.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/SmoothingFunction.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/SmoothingFunction.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/SmoothingFunction.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/MatchCounter.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/MatchCounter.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/MatchCounter.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/MatchCounter.cs
Outdated
Show resolved
Hide resolved
/// <summary> | ||
/// Tokenizes a string into segments using the common rules established by the NLTK word tokenizer. | ||
/// </summary> | ||
public static class SimpleWordTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we could use some pre-existing library for this and avoid adding this to our public surface area. Does Microsoft.ML.Tokenizers contain anything we could use for this? Alternatively, can we potentially move this there? Tagging @luisquintanilla for awareness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find anything suitable in Microsoft.ML.Tokenizers
. This tokenizer is quite a bit simpler than most of the other word-based tokenizers and seems to only be used in the NLP space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, adding another dependency would be unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependency could be in the user's code - we could accept a lambda as part of the evaluator's constructor and the user could provide a lambda that calls into another library.
However, on second look, does this need to be public? Looks like it is used within the BLEUEvaluator
but is there a scenario where a user would need to call this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @tarekgh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn’t this be handled using a regular expression? If you gave this code to Copilot, it would likely generate a suitable regex pattern rather than requiring all of this custom logic. This is generally how pre-tokenization works in our tokenizer library as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other implementations I looked at used a number of regular expressions, with slightly different replacements for each. I think the order of replacement is important. I unrolled it into this routine to make it more efficient so that we wouldn't have so many string allocations.
@tarekgh, can you point to an example pre-tokenization routine that you are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn’t suggesting that you use the pre-tokenization from the tokenizer library—my point was simply to ask whether a regular expression could work in your case instead of custom parsing logic. If your implementation is more optimized in terms of string allocations, that makes sense. I just wanted to check whether you had considered regex as an alternative.
If you're still curious about how pre-tokenization works in the tokenizer library, here's a reference to the relevant code. That said, it ultimately depends on the specific regex pattern you provide.
The implementation is optimized and also leverages source generators. Additionally, we have a composite pre-tokenizer that can chain multiple pre-tokenizers to run in sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also @tarekgh we were curious if the need for word tokenization has come up on your side before. This seems to be a popular need in (older) NLP algorithms like the ones being added in this PR although it's probably much simpler than what we need to support in the ML / LLM contexts now. We were not sure how extensible the word tokenization might need to be for the NLP scores that we are implementing (i.e. would users need more control on how to tokenize strings for different NLP algorithms or different kinds of input strings). And we were also not sure if there are already existing .NET APIs that handle some of this.
For now, we decided to make the tokenization helpers above internal and wait for feedback to see if users ask for more customizability. But just curious in case you have any insights / pointers to share around this.
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/SimpleWordTokenizer.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/BLEU/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Common/BLEUAlgorithm.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Common/NGramExtensions.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Common/MatchCounter.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Evaluation.NLP/Common/MatchCounter.cs
Show resolved
Hide resolved
|
||
internal static class Constants | ||
{ | ||
public const string Version = "$(Version)"%3B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Version
being used in the code in this project or was this file just created as a result of a copy paste? Please delete the file if the constant is unused in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of our libraries include that version number type. Seems like it would be a good idea to retain consistency here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think all of them do - for example, the Safety
package does not and neither does Quality
or the core Evaluation
package. It is currently only used in Reporting
and Console
both of which need to reference the Version
in the code.
/// <summary> | ||
/// Tokenizes a string into segments using the common rules established by the NLTK word tokenizer. | ||
/// </summary> | ||
[Experimental("AIEVAL001")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline, let's make this internal
for now. We can wait for feedback to see if we need extensibility for word tokenization.
Please remove the [Experimental]
here and the <InjectExperimentalAttributeOnLegacy>
from the project file as part of making this internal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already pushed the change to make it internal. We are using experimental on other classes, so the injection should remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use Experimental
on the other classes if the package itself is in preview? The tokenizer usage was special since there were questions about what the API should look like / whether it should be part of some other library.
But I don't think there are similar questions for the rest of the stuff - and we would be reviewing the evaluators and contexts before marking the package public regardless... Also, the Safety
evaluators which are also in preview are not Experimental
currently, so marking just the NLP
ones Experimental
would be confusing.
Add a new library for algorithmic NPL scoring evaluators, named
Microsoft.Extensions.AI.Evaluation.NLP
.The first such evaluator that is implemented is BLEU. For this implementation we default to 4 even weights for the n-gram comparisons and we use the smoothing method 4 from Chen and Cherry (2014) for sentence-level BLEU scores. These are the same defaults chosen by the Azure python evaluation SDK.
Also included is a simple word tokenizer based on the tokenizers used for other BLEU implementations, such as MOSES, SacreBLEU and NLTK.
Microsoft Reviewers: Open in CodeFlow