Skip to content

Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? [LUCENE-10493] #11529

Closed
@asfimport

Description

@asfimport

We now have common dictionary interfaces for kuromoji and nori (#11429). A natural question would be: is it possible to unify the Japanese/Korean tokenizers?

The core methods of the two tokenizers are parse() and backtrace() to calculate the minimum cost path by Viterbi search. I'd set the goal of this issue to factoring out them into a separate class (in analysis-common) that is shared between JapaneseTokenizer and KoreanTokenizer.
The algorithm to solve the minimum cost path itself is of course language-agnostic, so I think it should be theoretically possible; the most difficult part here might be the N-best path calculation - which is supported only by JapaneseTokenizer and not by KoreanTokenizer.


Migrated from LUCENE-10493 by Tomoko Uchida (@mocobeta), resolved Jul 18 2022
Sub-tasks:

Pull requests: #793, #795, #801, #805

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions