Ignore diacritics when searching #20550
-
We've lots of indexes in our application where we use full text search by indexing a property like
We would like to get diacritics to be ignored when searching. For example, the search term ‘helene’ should also return results such as “hèlene” or ‘hélène’. This way I would like to ignore all diacritics. In the meantime, is there a more elegant way than using a custom analyzer? If the solution is still a custom analyzer, could you please check, if the following analyzer would correctly extend
Should we reuse the previous tokenstream again like you do in RavenStandardAnalyzer? (https://github.com/ravendb/ravendb/blob/1bb10eb36edadcafae0e220e783036c9505d3e54/src/Raven.Server/Documents/Indexes/Persistence/Lucene/Analyzers/RavenStandardAnalyzer.cs) Thanks for any help. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Hi, Currently you cannot derive from About Simple analyzer: using System.IO;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
namespace Analyzer;
public class LanguageAnalyzer : Lucene.Net.Analysis.Standard.StandardAnalyzer
{
public LanguageAnalyzer() : base(Lucene.Net.Util.Version.LUCENE_29)
{
}
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
return new ASCIIFoldingFilter(base.TokenStream(fieldName, reader));
}
} |
Beta Was this translation helpful? Give feedback.
-
Thanks for your help, @maciejaszyk. Could you explain what the Lucene version is all about? Do we need to keep it up to date? Does the version have to correspond to that of RavenDb? Is there a constant for this? Thank you very much for a short feedback. |
Beta Was this translation helpful? Give feedback.
Hi,
I would say a custom analyzer is the most elegant way. This way you don't have to handle diacritics removal in your app.
Currently you cannot derive from
RavenStandardAnalyzer
because it is sealed. However, you can inherit fromStandardAnalyzer
, as the version we use isVersion.LUCENE_29
.About
ReusableTokenStream
, it gives some performance gains. However, please be careful when implementing an override of it to avoid state-sharing between calls.Simple analyzer: