Closed
Description
Current Behavior
When running the tesseract binary with specific language packs enabled, the binary segfaults (different places for different combinations).
root@78ce6cbf2cc5:/opt/tesseract/bin# gdb ./tesseract
GNU gdb (Ubuntu 14.0.50.20230907-0ubuntu1) 14.0.50.20230907-git
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./tesseract...
(No debugging symbols found in ./tesseract)
(gdb) run sample_012631.jpg - --tessdata-dir <snip>/tessdata/ -l chi_sim+mrz
Starting program: /opt/tesseract/bin/tesseract sample_012631.jpg - --tessdata-dir <snip>/tessdata/ -l chi_sim+mrz
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 548
Detected 108 diacritics
Program received signal SIGSEGV, Segmentation fault.
0x00005619906db4e8 in tesseract::ELIST_ITERATOR::forward() ()
(gdb) bt
#0 0x00005619906db4e8 in tesseract::ELIST_ITERATOR::forward() ()
#1 0x00005619906bc1ff in tesseract::PAGE_RES_IT::ReplaceCurrentWord(tesseract::PointerVector<tesseract::WERD_RES>*) ()
#2 0x000056199074ea84 in tesseract::Tesseract::classify_word_and_language(int, tesseract::PAGE_RES_IT*, tesseract::WordData*) ()
#3 0x0000561990752adc in tesseract::Tesseract::RecogAllWordsPassN(int, tesseract::ETEXT_DESC*, tesseract::PAGE_RES_IT*, std::vector<tesseract::WordData, std::allocator<tesseract::WordData> >*) ()
#4 0x00005619907537ba in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) ()
#5 0x0000561990719b46 in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) ()
#6 0x000056199071a3d2 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) ()
#7 0x000056199071b4ea in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#8 0x000056199071bb92 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#9 0x00005619906a6331 in main ()
root@78ce6cbf2cc5:/opt/tesseract/bin# gdb ./tesseract
GNU gdb (Ubuntu 14.0.50.20230907-0ubuntu1) 14.0.50.20230907-git
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./tesseract...
(No debugging symbols found in ./tesseract)
(gdb) run sample_000111.jpg - --tessdata-dir <snip>/tessdata/ -l eng+ara+ocrb_int
Starting program: /opt/tesseract/bin/tesseract sample_000111.jpg - --tessdata-dir <snip>/tessdata/ -l eng+ara+ocrb_int
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 975
Program received signal SIGSEGV, Segmentation fault.
0x000055e53cf153c6 in tesseract::PAGE_RES_IT::DeleteCurrentWord() ()
(gdb) bt
#0 0x000055e53cf153c6 in tesseract::PAGE_RES_IT::DeleteCurrentWord() ()
#1 0x000055e53cfaf5a8 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) ()
#2 0x000055e53cf75b46 in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) ()
#3 0x000055e53cf763d2 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) ()
#4 0x000055e53cf774ea in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#5 0x000055e53cf77b92 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#6 0x000055e53cf02331 in main ()
Expected Behavior
No segfaults.
Suggested Fix
No response
tesseract -v
root@78ce6cbf2cc5:/opt/tesseract/bin# ./tesseract -v
tesseract 5.3.3-1-gdc22
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.2) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.2.4 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.6.2 zlib/1.2.13 liblzma/5.4.0 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.2
Found libcurl/8.2.1 OpenSSL/3.0.10 zlib/1.2.13 brotli/1.0.9 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.3) libssh/0.10.5/openssl/zlib nghttp2/1.55.1 librtmp/2.3 OpenLDAP/2.6.6
Operating System
No response
Other Operating System
Ubuntu 23.10
uname -a
root@78ce6cbf2cc5:/opt/tesseract/bin# uname -a
Linux 78ce6cbf2cc5 5.4.0-150-generic #167~18.04.1-Ubuntu SMP Wed May 24 00:51:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Compiler
GCC 13.2.0
CPU
Intel Xeon CPU E5-2609 v4 @ 1.70GHz x16
Virtualization / Containers
I'm building and running tesseract
inside a docker (v24.0.2) container, where the container is running the OS and compiler versions listed above.
Other Information
No response