Open
Description
Environment
- Tesseract Version: Various,
4.1.1
,5.0.0 v20201231
- Platform: Linux, 64 bit
Current Behavior:
In some cases, Tesseract fully automatic page segmentation does not pick up page numbers that are quite visible. Here is an example (as hocr result viewer):
I have taken the liberty of hosting some images with this problem here (and can attempt to surface more if that is helpful):
https://archive.org/~merlijn/tesseract-pagenumbers/
I don't believe that the problem is related to binarisation. I've tried to run the Java viewer (https://tesseract-ocr.github.io/tessdoc/ViewerDebugging.html) to look at the results, but didn't become much wiser, as it simply shows the page numbers not being picked up.
Expected Behavior:
Tesseract picks up the page number as well.