Skip to content

Modernize datadir #4372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

zdenop
Copy link
Contributor

@zdenop zdenop commented Dec 18, 2024

  • use std::filesystem::path instead of std::string for datadir
  • add warning if datadir is not directory or does not exists

Copy link
Contributor

@egorpugin egorpugin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably all is_directory() checks are unnecessary. User will just get any other random error if he points to a file instead of to a dir.

@zdenop
Copy link
Contributor Author

zdenop commented Dec 19, 2024

@stweil : Thank you for review. Unittests needs be fixed yet.

@egorpugin
Copy link
Contributor

Old tprintfs should be removed slowly in favor of std::format.

Signed-off-by: Stefan Weil <[email protected]>
@stweil
Copy link
Member

stweil commented Dec 19, 2024

Old tprintfs should be removed slowly in favor of std::format.

Isn't the tesserr stream even better?

@egorpugin
Copy link
Contributor

Streams are better than tprintf.
But when using them a lot soon you will notice that breaking strings for any insertion is quiet disturbing.

"we add two numbers: " << num1 << " + " << num2 << " = " << num1+num2;

compared to

std::format("we add two numbers: {} + {} = {}", num1, num2, num1+num2);

Having even more insertion points the original string can be hardly readable compared to a string with {} placeholders.

@egorpugin
Copy link
Contributor

Isn't the tesserr stream even better?

I mean with std::format() the usage will be something like tesserr << std::format("format string", args ... );.

@egorpugin
Copy link
Contributor

I've slightly updated tprintf() to C++ style.

8cb0418

@stweil
Copy link
Member

stweil commented Dec 19, 2024

I've slightly updated tprintf() to C++ style.

Now I get numerous compiler warnings:

../../../src/ccutil/tprintf.h:37:33: warning: format string is not a string literal (potentially insecure) [-Wformat-security]

And the code size is larger, too.

@egorpugin
Copy link
Contributor

  1. Warning should be silenced. Updated code is completely fine and valid. We should remove -Wformat-security flag if set or set -Wno-format-security.
  2. What is code size change? In general this should be ignored. But I'm curious what the difference is. Probably 5 bytes for each get_debugfp() and some pre- and post call setups.

@egorpugin
Copy link
Contributor

Can you try changing it to

template <typename ... Types>
auto tprintf(const char *format, Types && ... args) {
  return fprintf(get_debugfp(), format, std::forward<Types>(args)...);
}

?

@stweil
Copy link
Member

stweil commented Dec 19, 2024

Can you try changing it to [...]

The suggested change still produces the compiler warnings, but I think it improves the code because like that calling tprintf() without any argument would be caught as an error.

@egorpugin
Copy link
Contributor

It will be an error too if we call tprintf() even without const char *format.

1>tesseract\src\ccutil\tprintf.h(37,10): error C2660: 'fprintf': function does not take 1 arguments
1>(compiling source file '../../../../../src/ccutil/tprintf.cpp')
1>    C:\Program Files (x86)\Windows Kits\10\Include\10.0.22621.0\ucrt\stdio.h(830,37):
1>    see declaration of 'fprintf'
1>    tesseract\src\ccutil\tprintf.h(37,10):
1>    while trying to match the argument list '(FILE *)'
1>    tesseract\src\ccutil\tprintf.h(37,10):
1>    the template instantiation context (the oldest one first) is
1>        tesseract\src\ccutil\tprintf.cpp(74,3):
1>        see reference to function template instantiation 'auto tesseract::tprintf<>(void)' being compiled

@egorpugin
Copy link
Contributor

Added format in d95e9f7

I'll check ci for other warnings.

@egorpugin
Copy link
Contributor

I don't see warnings on gcc-14 (even without format).
Only with clang.

Possible solution is

auto tprintf(const char *format, Types && ... args) {
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wformat-security"
  return fprintf(get_debugfp(), format, std::forward<Types>(args)...);
#pragma clang diagnostic pop
}```

@zdenop
Copy link
Contributor Author

zdenop commented Dec 21, 2024

Seems like unittests GA are failing on the main (too), so it is not related to this PR:
image
Did I understand it correctly?

@zdenop
Copy link
Contributor Author

zdenop commented Dec 27, 2024

@egorpugin : I cherry-picked your commits regarding tprintf to my local branch it does not work for 'tessdata_path.string()'.
E.g. it produces output like Error opening data file ��]� for tesseract a b -l x (Win11 VS2022 64bit).
When I use tessdata_path.string().c_str() output is correct
Error opening data file F:\Projects\Community\tessdata\x.traineddata....

@egorpugin
Copy link
Contributor

How do you use it?

@zdenop
Copy link
Contributor Author

zdenop commented Dec 28, 2024

I am sorry, it looks like I messed something up... Today I started from clean setup and it worked for me:

git pull
git fetch origin pull/4372/head:pr4372
git switch pr4372
git cherry-pick 8cb04183c1953f988
git cherry-pick d95e9f7905cc9427d
git cherry-pick 2a944fbe98ed4408a

./autogen.sh && ./configure --prefix=/usr && make -j4 && make -j4 training
sudo make install && sudo make training-install

@zdenop
Copy link
Contributor Author

zdenop commented Dec 28, 2024

I looks like unittest fails if api_.Init uses tesseract::OEM_TESSERACT_ONLY but I am not able to find reason why (well api_. returns empty string, but why?)

If I made simplified test based on unittest/pagesegmode_test.cc it work for me. So where is problem? (gtest is installed from unittest/third_party/googletest)

/* g++ ocr_gtest.cpp -o ocr_gtest -lleptonica -ltesseract -lgtest
*/
##include <gtest/gtest.h>
#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>

TEST(OCRTest, ReadImage) {
    tesseract::TessBaseAPI api_;
    std::string TessdataPath = "/opt/Projects/tessdata";
    std::string filename = "test/testing/segmodeimg.tif";

    Pix *image = pixRead(filename.c_str());
    ASSERT_NE(image, nullptr) << "Failed to read image";

    ASSERT_EQ(api_.Init(TessdataPath.c_str(), "eng"), 
              tesseract::OEM_TESSERACT_ONLY) << "Failed to initialize Tesseract";

    api_.SetPageSegMode(tesseract::PSM_SINGLE_WORD);
    api_.SetImage(image);
    api_.SetRectangle(237, 393, 256, 36);
    char *ocr_text = api_.GetUTF8Text();
    ASSERT_NE(ocr_text, nullptr) << "OCR returned null";
    
    std::string ocr_output(ocr_text);
    printf("OCR output:\n'%s'\n", ocr_output.c_str());

    EXPECT_EQ(ocr_output, "What should\n");

    delete[] ocr_text; // Free OCR text
    api_.End(); // Cleanup Tesseract API
    pixDestroy(&image); // Cleanup image
}

int main(int argc, char** argv) {
    ::testing::InitGoogleTest(&argc, argv);
    return RUN_ALL_TESTS();
}

@egorpugin
Copy link
Contributor

Ignore unit test errors, they can be fixed after.

@@ -146,6 +146,9 @@ static void ExtractFontName(const char* filename, std::string* fontname) {
*/
static void addAvailableLanguages(const std::string &datadir,
std::vector<std::string> *langs) {
if (!std::filesystem::is_directory(datadir))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want this check? My impression is that we agreed to prefer reporting the exception which is thrown without this conditional code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If throwing an exception is the agreed-upon approach, then this check is no longer necessary.

Copy link
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot merge this PR as long as several CI tests fail. Some of them seem to be related to layout recognition (for example layout_test) and could indicate a severe regression.

*/
void CCUtil::main_setup(const std::string &argv0, const std::string &basename) {
imagebasename = basename; /**< name of image */
datadir = find_data_path(argv0);
Copy link
Member

@stweil stweil May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old datadir was always terminated with a '/' which is missing for the new datadir. This is an API change, but it might also be it is not the reason for the failing CI tests.

tprintf("Error opening data file %s\n", tessdata_path.c_str());
tprintf(
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.string().c_str())) {
tesserr << "Error opening data file " << tessdata_path.string() << '\n' <<
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tesserr << "Error opening data file " << tessdata_path.string() << '\n' <<
tesserr << "Error opening data file " << tessdata_path << '\n' <<

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the difference in path formatting between using tesseract.exe a b -l c with tessdata_path.string() versus tessdata_path:

  • tessdata_path.string() produces a native OS path (e.g., on Windows):
Error opening data file X:\Projects\tesseract\c.traineddata
  • tessdata_path produces an escaped path:
Error opening data file "X:\\Projects\\tesseract\\c.traineddata"

I believe it’s more user-friendly and intuitive to display the native OS path. Is there a specific reason we should use the escaped version instead? If not, I recommend sticking with the native path for error messages.

"not present in %s!!\n",
tessdata_path.c_str());
tesserr << "Error: Tesseract (legacy) engine requested, but components are "
"not present in " << tessdata_path.string() << "!!\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"not present in " << tessdata_path.string() << "!!\n";
"not present in " << tessdata_path << "!!\n";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above: I recommend sticking with the native path for error messages

*/
void CCUtil::main_setup(const std::string &argv0, const std::string &basename) {
imagebasename = basename; /**< name of image */
datadir = find_data_path(argv0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datadir = find_data_path(argv0);
datadir = find_data_path(argv0) / "";

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, datadir was a string, and appending a / was used as a workaround to handle cases where users omitted the directory separator.
However, with datadir now being a std::filesystem::path, this manual string manipulation is no longer necessary. The path class automatically normalizes path separators and handles path operations (like concatenation) correctly for the underlying OS.
As a result, we can safely remove the workaround and rely on the standard library to manage path separators and concatenation.
Or do I miss something?

if (!mgr->is_loaded() && !mgr->Init(tessdata_path.c_str())) {
tprintf("Error opening data file %s\n", tessdata_path.c_str());
tprintf(
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.string().c_str())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.string().c_str())) {
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.c_str())) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would broke msvc build:
error C2664: 'bool tesseract::TessdataManager::Init(const char *)': cannot convert argument 1 from 'const std: :filesystem::path::value_type *' to 'const char *'

@@ -81,17 +83,13 @@ bool Tesseract::init_tesseract_lang_data(const std::string &arg0,
bool set_only_non_debug_params, TessdataManager *mgr) {
// Set the language data path prefix
lang = !language.empty() ? language : "eng";
language_data_path_prefix = datadir;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language_data_path_prefix is a class variable which is no longer set in the new code. This causes the CI failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed 2a296fa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants