-
Notifications
You must be signed in to change notification settings - Fork 1.2k
LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori #805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I took the same bottom-up approach as https://issues.apache.org/jira/browse/LUCENE-10393 here again (determine the duplicate code and sort out the interfaces). I tried but couldn't break it up into smaller patches... I will keep open this for waiting for feedback. Hope these changes make sense. |
@rmuir would you mind reviewing this, or do you think we shouldn't proceed this way? |
Sure, sorry for the slow response. |
I think @mikemccand actually created most of this code and is most familiar with it. Mike, if you have time can you look too? For the special n-best class, is the issue that |
Whoa, this sounds awesome! I will try to review soon. Thanks @mocobeta. |
Hi Robert and Mike, thank you for your response. I think this can be kept open for a sufficient time period - it is unlikely to happen large conflicts between these changes and the main branch. I reviewed it several times during making this patch and I believe this does not change the tokenizers' behavior (although I can't fully guarantee that). I'd be glad if you give feedback about the overall design, and ideas to make the interfaces safer for further refactoring or development. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mocobeta for tackling this crazy complex code!!
Big +1 to take progress-not-perfection approach and factor out code that is clearly in common into Viterbi
base-class.
We can later try to go further e.g. bringing n-best to Nori as well. I peeked at the methods like backtrace
and they really are quite divergent across the two, and I added a comment on one possible bug-fix that we failed to cross-port. So it's awesome to take this step today to prevent further unnecessary divergence, and to enable future tokenizers to more easily re-use the Viterbi algorithm for finding best 1 or N tokenization paths for weighted dictionary based tokenizers.
I love how much common code you were able to delete! And how GraphvizFormatter
is also factored out :)
* avoid an assertion error {@link RollingCharBuffer#get(int)} when it tries to generate an | ||
* empty buffer | ||
*/ | ||
if (endPos == lastBackTracePos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this was a bug fix to Kuromoji that should've been ported over to Nori?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I think this should be ported to Nori.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome, thanks @mocobeta!
@mikemccand thanks for reviewing such a huge PR! I'll look through this again, and merge it maybe next week. |
I defer to Mike, he understands this Viterbi. I am happy to see this refactoring, thank you! |
I'm merging this and will observe the nightly benchmark (this shouldn't affect the analysis performance). |
* main: (68 commits) LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned results consistent with lookup() during concurrent build() LUCENE-10525 Improve WindowsFS emulation to catch invalid file names (apache#829) LUCENE-10508: Use MIN_WIDE_EXTENT for all wide rectangles (apache#845) LUCENE-10470: [Tessellator] Fix some failing polygons due to collinear edges (apache#756) LUCENE-10508: Fix error for rectangles with an extent close to 180 degrees (apache#824) LUCENE-10529: Fix TestTaxonomyFacetAssociations NPE when randomly indexing no documents for dim fix path to jar file in demo documentation LUCENE-10499: reduce unnecessary copy data overhead when growing array size (apache#786) LUCENE-10535: upgrade com.palantir.consistent-versions to 2.10.0 LUCENE-10533: SpellChecker.formGrams is missing bounds check (apache#836) Upgrade spotless and use runToFixMessage for 'gradlew tidy' hint. (apache#834) Fix JVM error branch logic. (apache#835) LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori (apache#805) LUCENE-8836: Speed up TermsEnum#lookupOrd on increasing sequences of ords. (apache#827) LUCENE-10528: use Xvfb in test to avoid messing up user's desktop (apache#828) LUCENE-10315: Speed up DocIdsWriter by ForUtil (apache#797) LUCENE-10517: Improve performance of SortedSetDV faceting by iterating on class types (apache#812) Add two facet tests (apache#826) fail clearly on too-new JDK (apache#819) improve spotless error to suggest running 'gradlew tidy' (apache#817) ...
I looked closely at the code again and then found the n-best logic can be separated from JapaneseTokenizer (not perfect though); opened a follow-up #846. It's a safe change I think and I plan to merge this to main if there are no objections. |
See also https://issues.apache.org/jira/browse/LUCENE-10493.
Main changes:
org.apache.lucene.analysis.morph.Viterbi
, which performs the Viterbi algorithm for Hidden Markov Model (or Conditional Random Fields) based tokenizers.forward()
(read one character from the input and grow the graph) andbackward()
(add output tokens on the minimum cost path to the pending list that is consumed byTokenizer#incrementToken()
).Viterbi
to Kuromoji and Nori.backtrace()
inJapaneseTokenizer
andKoreanTokenizer
needs language-specific information. We could (and should) decouple each language-related stuff frombacktrace()
, but it'd be not a trivial refactoring to me. I'd defer it for now.backtraceNBest()
needs information that is specific to JapaneseToeknizer... I'd defer the generalization for now.org.apache.lucene.analysis.morph.GraphvizFormatter
class to analysis-common, which is shared between Kuromoji and Nori.org.apache.lucene.analysis.ja.GraphvizFormatter
andorg.apache.lucene.analysis.ko.GraphvizFormatter
are removed.Viterbi
,JapaneseTokenizer
andKoreanTokenizer
are facade classes that only have constructors andincrementToken()
,close()
,reset()
,end()
.