Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? [LUCENE-10493]

We now have common dictionary interfaces for kuromoji and nori (#11429). A natural question would be: is it possible to unify the Japanese/Korean tokenizers? 

The core methods of the two tokenizers are `parse()` and `backtrace()` to calculate the minimum cost path by Viterbi search. I'd set the goal of this issue to factoring out them into a separate class (in analysis-common) that is shared between JapaneseTokenizer and KoreanTokenizer. 
The algorithm to solve the minimum cost path itself is of course language-agnostic, so I think it should be theoretically possible; the most difficult part here might be the N-best path calculation - which is supported only by JapaneseTokenizer and not by KoreanTokenizer.



---
Migrated from [LUCENE-10493](https://issues.apache.org/jira/browse/LUCENE-10493) by Tomoko Uchida (@mocobeta), resolved Jul 18 2022
Sub-tasks:
 - #11533

Pull requests: https://github.com/apache/lucene/pull/793, https://github.com/apache/lucene/pull/795, https://github.com/apache/lucene/pull/801, https://github.com/apache/lucene/pull/805


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? [LUCENE-10493] #11529

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? [LUCENE-10493] #11529

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions