LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori #805

mocobeta · 2022-04-08T15:00:35Z

See also https://issues.apache.org/jira/browse/LUCENE-10493.

Main changes:

Add a utility class org.apache.lucene.analysis.morph.Viterbi, which performs the Viterbi algorithm for Hidden Markov Model (or Conditional Random Fields) based tokenizers.
- The most important interfaces of this class are forward() (read one character from the input and grow the graph) and backward() (add output tokens on the minimum cost path to the pending list that is consumed by Tokenizer#incrementToken()).
Add extensions of Viterbi to Kuromoji and Nori.
- Note: Those extensions are needed because the current backtrace() in JapaneseTokenizer and KoreanTokenizer needs language-specific information. We could (and should) decouple each language-related stuff from backtrace(), but it'd be not a trivial refactoring to me. I'd defer it for now.
- Note: The extension for Kuromoji also performs n-best cost path calculation. It would be great if we decouple this from JapaneseTokenizer but I found the current backtraceNBest() needs information that is specific to JapaneseToeknizer... I'd defer the generalization for now.
Add a parameterized org.apache.lucene.analysis.morph.GraphvizFormatter class to analysis-common, which is shared between Kuromoji and Nori.
- org.apache.lucene.analysis.ja.GraphvizFormatter and org.apache.lucene.analysis.ko.GraphvizFormatter are removed.
Since all the segmentation work is now done by Viterbi, JapaneseTokenizer and KoreanTokenizer are facade classes that only have constructors and incrementToken(), close(), reset(), end().

mocobeta · 2022-04-11T12:28:47Z

I took the same bottom-up approach as https://issues.apache.org/jira/browse/LUCENE-10393 here again (determine the duplicate code and sort out the interfaces).
I'll look through this again - some private/package-private members have to be made public/protected at this moment. Once we further distill the backtrace() and n-best logic and move it to analysis-common from kuromoji/nori, they can be private again; I hope (and believe) it is possible but I'd leave it for future examination.

I tried but couldn't break it up into smaller patches... I will keep open this for waiting for feedback. Hope these changes make sense.

mocobeta · 2022-04-13T04:26:23Z

@rmuir would you mind reviewing this, or do you think we shouldn't proceed this way?
I know it's a bit radical refactoring but I cannot come up with a better way to unify the two tokenizers than this, sorry... let me know if it'd be better to stop pursuing this way.

rmuir · 2022-04-15T11:56:51Z

Sure, sorry for the slow response.

rmuir · 2022-04-15T12:02:44Z

I think @mikemccand actually created most of this code and is most familiar with it. Mike, if you have time can you look too?

For the special n-best class, is the issue that nori simply doesn't offer this n-best feature? For the future, I wonder if there is some reason it doesn't make sense there too? Maybe it was just overlooked before?

mikemccand · 2022-04-15T14:16:13Z

Whoa, this sounds awesome! I will try to review soon. Thanks @mocobeta.

mocobeta · 2022-04-15T15:46:24Z

Hi Robert and Mike, thank you for your response. I think this can be kept open for a sufficient time period - it is unlikely to happen large conflicts between these changes and the main branch.

I reviewed it several times during making this patch and I believe this does not change the tokenizers' behavior (although I can't fully guarantee that). I'd be glad if you give feedback about the overall design, and ideas to make the interfaces safer for further refactoring or development.

mikemccand

Thank you @mocobeta for tackling this crazy complex code!!

Big +1 to take progress-not-perfection approach and factor out code that is clearly in common into Viterbi base-class.

We can later try to go further e.g. bringing n-best to Nori as well. I peeked at the methods like backtrace and they really are quite divergent across the two, and I added a comment on one possible bug-fix that we failed to cross-port. So it's awesome to take this step today to prevent further unnecessary divergence, and to enable future tokenizers to more easily re-use the Viterbi algorithm for finding best 1 or N tokenization paths for weighted dictionary based tokenizers.

I love how much common code you were able to delete! And how GraphvizFormatter is also factored out :)

mikemccand · 2022-04-17T16:17:12Z

lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java

+     * avoid an assertion error {@link RollingCharBuffer#get(int)} when it tries to generate an
+     * empty buffer
+     */
+    if (endPos == lastBackTracePos) {


I wonder if this was a bug fix to Kuromoji that should've been ported over to Nori?

+1 I think this should be ported to Nori.

mikemccand

Looks awesome, thanks @mocobeta!

mocobeta · 2022-04-21T13:22:55Z

@mikemccand thanks for reviewing such a huge PR!

I'll look through this again, and merge it maybe next week.
@rmuir if you have any comments I'll try to reflect the feedback on this, thanks.

rmuir · 2022-04-21T13:31:18Z

I defer to Mike, he understands this Viterbi. I am happy to see this refactoring, thank you!

mocobeta · 2022-04-25T11:09:21Z

I'm merging this and will observe the nightly benchmark (this shouldn't affect the analysis performance).

* main: (68 commits) LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned results consistent with lookup() during concurrent build() LUCENE-10525 Improve WindowsFS emulation to catch invalid file names (apache#829) LUCENE-10508: Use MIN_WIDE_EXTENT for all wide rectangles (apache#845) LUCENE-10470: [Tessellator] Fix some failing polygons due to collinear edges (apache#756) LUCENE-10508: Fix error for rectangles with an extent close to 180 degrees (apache#824) LUCENE-10529: Fix TestTaxonomyFacetAssociations NPE when randomly indexing no documents for dim fix path to jar file in demo documentation LUCENE-10499: reduce unnecessary copy data overhead when growing array size (apache#786) LUCENE-10535: upgrade com.palantir.consistent-versions to 2.10.0 LUCENE-10533: SpellChecker.formGrams is missing bounds check (apache#836) Upgrade spotless and use runToFixMessage for 'gradlew tidy' hint. (apache#834) Fix JVM error branch logic. (apache#835) LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori (apache#805) LUCENE-8836: Speed up TermsEnum#lookupOrd on increasing sequences of ords. (apache#827) LUCENE-10528: use Xvfb in test to avoid messing up user's desktop (apache#828) LUCENE-10315: Speed up DocIdsWriter by ForUtil (apache#797) LUCENE-10517: Improve performance of SortedSetDV faceting by iterating on class types (apache#812) Add two facet tests (apache#826) fail clearly on too-new JDK (apache#819) improve spotless error to suggest running 'gradlew tidy' (apache#817) ...

mocobeta · 2022-04-28T01:54:02Z

I looked closely at the code again and then found the n-best logic can be separated from JapaneseTokenizer (not perfect though); opened a follow-up #846. It's a safe change I think and I plan to merge this to main if there are no objections.

mocobeta added 11 commits April 8, 2022 19:58

create Viterbi class, rewrite KoreanTokenizer

ccb63a8

generalize GraphvizFormatter

3f52a25

add license

a885c45

make WarppedPositionnArray generic.

a7be6eb

add methods to Viterbi class for supporting JapaneseTokenizer

1eb7420

rewrite JapaneseTokenizer

1fb5398

add license; lint

db27725

add javadocs

83eccfb

add javadocs

29ae97c

need to the switch by 'outputLongestUserEntryOnly'

4760777

reduce visibility of some member fields/methods

3a87aaf

mocobeta marked this pull request as ready for review April 11, 2022 11:34

mikemccand approved these changes Apr 17, 2022

View reviewed changes

mocobeta added 2 commits April 19, 2022 18:56

apply the same fix in LUCENE-10059 to nori

260b77d

reduce visibility of some members

58eb8df

mikemccand approved these changes Apr 21, 2022

View reviewed changes

mocobeta added 3 commits April 25, 2022 19:02

Merge branch 'main' into lucene-10493-viterbi-2

6f75ffa

changes entry

17c6884

make an adjastment in changes text

03e998b

mocobeta merged commit c89f8a7 into apache:main Apr 25, 2022

mocobeta deleted the lucene-10493-viterbi-2 branch April 25, 2022 11:09

mocobeta mentioned this pull request Apr 27, 2022

LUCENE-10493: move n-best logic to analysis-common #846

Merged

mocobeta mentioned this pull request May 8, 2022

LUCENE-10059: Additional fix to handle n_best backtrace #284

Open

asfimport mentioned this pull request Jul 18, 2022

Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? [LUCENE-10493] #11529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori #805

LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori #805

Uh oh!

mocobeta commented Apr 8, 2022 •

edited

Loading

Uh oh!

mocobeta commented Apr 11, 2022 •

edited

Loading

Uh oh!

mocobeta commented Apr 13, 2022

Uh oh!

rmuir commented Apr 15, 2022

Uh oh!

rmuir commented Apr 15, 2022

Uh oh!

mikemccand commented Apr 15, 2022

Uh oh!

mocobeta commented Apr 15, 2022

Uh oh!

mikemccand left a comment

Uh oh!

mikemccand Apr 17, 2022

Uh oh!

mocobeta Apr 18, 2022

Uh oh!

mikemccand left a comment

Uh oh!

mocobeta commented Apr 21, 2022

Uh oh!

rmuir commented Apr 21, 2022

Uh oh!

mocobeta commented Apr 25, 2022

Uh oh!

mocobeta commented Apr 28, 2022

Uh oh!

Uh oh!

LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori #805

LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori #805

Uh oh!

Conversation

mocobeta commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mocobeta commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mocobeta commented Apr 13, 2022

Uh oh!

rmuir commented Apr 15, 2022

Uh oh!

rmuir commented Apr 15, 2022

Uh oh!

mikemccand commented Apr 15, 2022

Uh oh!

mocobeta commented Apr 15, 2022

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

mikemccand Apr 17, 2022

Choose a reason for hiding this comment

Uh oh!

mocobeta Apr 18, 2022

Choose a reason for hiding this comment

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

mocobeta commented Apr 21, 2022

Uh oh!

rmuir commented Apr 21, 2022

Uh oh!

mocobeta commented Apr 25, 2022

Uh oh!

mocobeta commented Apr 28, 2022

Uh oh!

Uh oh!

mocobeta commented Apr 8, 2022 •

edited

Loading

mocobeta commented Apr 11, 2022 •

edited

Loading