Description
We now have common dictionary interfaces for kuromoji and nori (#11429). A natural question would be: is it possible to unify the Japanese/Korean tokenizers?
The core methods of the two tokenizers are parse()
and backtrace()
to calculate the minimum cost path by Viterbi search. I'd set the goal of this issue to factoring out them into a separate class (in analysis-common) that is shared between JapaneseTokenizer and KoreanTokenizer.
The algorithm to solve the minimum cost path itself is of course language-agnostic, so I think it should be theoretically possible; the most difficult part here might be the N-best path calculation - which is supported only by JapaneseTokenizer and not by KoreanTokenizer.
Migrated from LUCENE-10493 by Tomoko Uchida (@mocobeta), resolved Jul 18 2022
Sub-tasks: