Skip to content

LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori #805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 25, 2022

Conversation

mocobeta
Copy link
Contributor

@mocobeta mocobeta commented Apr 8, 2022

See also https://issues.apache.org/jira/browse/LUCENE-10493.

Main changes:

  • Add a utility class org.apache.lucene.analysis.morph.Viterbi, which performs the Viterbi algorithm for Hidden Markov Model (or Conditional Random Fields) based tokenizers.
    • The most important interfaces of this class are forward() (read one character from the input and grow the graph) and backward() (add output tokens on the minimum cost path to the pending list that is consumed by Tokenizer#incrementToken()).
  • Add extensions of Viterbi to Kuromoji and Nori.
    • Note: Those extensions are needed because the current backtrace() in JapaneseTokenizer and KoreanTokenizer needs language-specific information. We could (and should) decouple each language-related stuff from backtrace(), but it'd be not a trivial refactoring to me. I'd defer it for now.
    • Note: The extension for Kuromoji also performs n-best cost path calculation. It would be great if we decouple this from JapaneseTokenizer but I found the current backtraceNBest() needs information that is specific to JapaneseToeknizer... I'd defer the generalization for now.
  • Add a parameterized org.apache.lucene.analysis.morph.GraphvizFormatter class to analysis-common, which is shared between Kuromoji and Nori.
    • org.apache.lucene.analysis.ja.GraphvizFormatter and org.apache.lucene.analysis.ko.GraphvizFormatter are removed.
  • Since all the segmentation work is now done by Viterbi, JapaneseTokenizer and KoreanTokenizer are facade classes that only have constructors and incrementToken(), close(), reset(), end().

@mocobeta mocobeta marked this pull request as ready for review April 11, 2022 11:34
@mocobeta
Copy link
Contributor Author

mocobeta commented Apr 11, 2022

I took the same bottom-up approach as https://issues.apache.org/jira/browse/LUCENE-10393 here again (determine the duplicate code and sort out the interfaces).
I'll look through this again - some private/package-private members have to be made public/protected at this moment. Once we further distill the backtrace() and n-best logic and move it to analysis-common from kuromoji/nori, they can be private again; I hope (and believe) it is possible but I'd leave it for future examination.

I tried but couldn't break it up into smaller patches... I will keep open this for waiting for feedback. Hope these changes make sense.

@mocobeta
Copy link
Contributor Author

@rmuir would you mind reviewing this, or do you think we shouldn't proceed this way?
I know it's a bit radical refactoring but I cannot come up with a better way to unify the two tokenizers than this, sorry... let me know if it'd be better to stop pursuing this way.

@rmuir
Copy link
Member

rmuir commented Apr 15, 2022

Sure, sorry for the slow response.

@rmuir
Copy link
Member

rmuir commented Apr 15, 2022

I think @mikemccand actually created most of this code and is most familiar with it. Mike, if you have time can you look too?

For the special n-best class, is the issue that nori simply doesn't offer this n-best feature? For the future, I wonder if there is some reason it doesn't make sense there too? Maybe it was just overlooked before?

@mikemccand
Copy link
Member

Whoa, this sounds awesome! I will try to review soon. Thanks @mocobeta.

@mocobeta
Copy link
Contributor Author

Hi Robert and Mike, thank you for your response. I think this can be kept open for a sufficient time period - it is unlikely to happen large conflicts between these changes and the main branch.

I reviewed it several times during making this patch and I believe this does not change the tokenizers' behavior (although I can't fully guarantee that). I'd be glad if you give feedback about the overall design, and ideas to make the interfaces safer for further refactoring or development.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mocobeta for tackling this crazy complex code!!

Big +1 to take progress-not-perfection approach and factor out code that is clearly in common into Viterbi base-class.

We can later try to go further e.g. bringing n-best to Nori as well. I peeked at the methods like backtrace and they really are quite divergent across the two, and I added a comment on one possible bug-fix that we failed to cross-port. So it's awesome to take this step today to prevent further unnecessary divergence, and to enable future tokenizers to more easily re-use the Viterbi algorithm for finding best 1 or N tokenization paths for weighted dictionary based tokenizers.

I love how much common code you were able to delete! And how GraphvizFormatter is also factored out :)

* avoid an assertion error {@link RollingCharBuffer#get(int)} when it tries to generate an
* empty buffer
*/
if (endPos == lastBackTracePos) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this was a bug fix to Kuromoji that should've been ported over to Nori?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think this should be ported to Nori.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, thanks @mocobeta!

@mocobeta
Copy link
Contributor Author

@mikemccand thanks for reviewing such a huge PR!

I'll look through this again, and merge it maybe next week.
@rmuir if you have any comments I'll try to reflect the feedback on this, thanks.

@rmuir
Copy link
Member

rmuir commented Apr 21, 2022

I defer to Mike, he understands this Viterbi. I am happy to see this refactoring, thank you!

@mocobeta
Copy link
Contributor Author

I'm merging this and will observe the nightly benchmark (this shouldn't affect the analysis performance).

@mocobeta mocobeta merged commit c89f8a7 into apache:main Apr 25, 2022
@mocobeta mocobeta deleted the lucene-10493-viterbi-2 branch April 25, 2022 11:09
wjp719 added a commit to wjp719/lucene that referenced this pull request Apr 28, 2022
* main: (68 commits)
  LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned results consistent with lookup() during concurrent build()
  LUCENE-10525 Improve WindowsFS emulation to catch invalid file names (apache#829)
  LUCENE-10508:  Use MIN_WIDE_EXTENT for all wide rectangles (apache#845)
  LUCENE-10470: [Tessellator] Fix some failing polygons due to collinear edges (apache#756)
  LUCENE-10508: Fix error for rectangles with an extent close to 180 degrees (apache#824)
  LUCENE-10529: Fix TestTaxonomyFacetAssociations NPE when randomly indexing no documents for dim
  fix path to jar file in demo documentation
  LUCENE-10499: reduce unnecessary copy data overhead when growing array size (apache#786)
  LUCENE-10535: upgrade com.palantir.consistent-versions to 2.10.0
  LUCENE-10533: SpellChecker.formGrams is missing bounds check (apache#836)
  Upgrade spotless and use runToFixMessage for 'gradlew tidy' hint. (apache#834)
  Fix JVM error branch logic. (apache#835)
  LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori (apache#805)
  LUCENE-8836: Speed up TermsEnum#lookupOrd on increasing sequences of ords. (apache#827)
  LUCENE-10528: use Xvfb in test to avoid messing up user's desktop (apache#828)
  LUCENE-10315: Speed up DocIdsWriter by ForUtil (apache#797)
  LUCENE-10517: Improve performance of SortedSetDV faceting by iterating on class types (apache#812)
  Add two facet tests (apache#826)
  fail clearly on too-new JDK (apache#819)
  improve spotless error to suggest running 'gradlew tidy' (apache#817)
  ...
@mocobeta
Copy link
Contributor Author

I looked closely at the code again and then found the n-best logic can be separated from JapaneseTokenizer (not perfect though); opened a follow-up #846. It's a safe change I think and I plan to merge this to main if there are no objections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants