Skip to content

RFC: allow flexible or better binarization #3083

Open
@bertsky

Description

@bertsky

Tesseract has always included its own, internal binarization – which is not based on Leptonica and is of rather bad quality (custom global Otsu implementation without normalization). Leptonica does have lots of nice adaptive local normalization and thresholding implementations, but they are not utilized.

Since Tesseract 4.0, recognition does not (under normal circumstances) use that binarized image, but uses the greyscale-converted raw image. If the input image was already bitonal, then it still works. However, what the LSTM model expects depends on what data it was trained on. (Original tessdata models were all trained on artificial/clean greyscale IIRC.)

But the binarized image is still needed for all segmentation and layout analysis (OSD, separator detection, picture detection). And to make matters worse, segmentation also makes use of the greyscale image and threshold values from the binarizer – in order to get a better approximation/interpolation of blob outlines (ComputeEdgeOffsets). If the input image was already bitonal, then a fallback method is used which is not as accurate (ComputeBinaryOffsets).

Now the user (both CLI and API) is in a dilemma:

  1. Present the originial/raw image:
    • Segmentation on some images may be suboptimal, because good binarization is hard.
    • Recognition might expect a (colour/contrast-) normalized or even binarized (e.g. from tesstrain) image and thus be suboptimal.
  2. Present an externally binarized image:
    • Segmentation on other images may be suboptimal, because the blob outlines are inaccurate.
    • Recognition might expect a greyscale image and thus be suboptimal.

So what do we do? Allow delegating to Leptonica's pixContrastNorm, pixSauvolaBinarizeTiled etc. methods via parameter variable? Or extend the API to allow passing the threshold values of an external binarization? How do we inform/document this and encapsulate an LSTM model's expectations?

There are many more aspects, but I just wanted to open the discussion.

If requested, I can provide example images (of bad segmentation due to internal Otsu or bad segmentation due to inaccurate interpolation from bitonal input; of recognition perplexed by background masked to white), as well as pointers to code.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions