-
Notifications
You must be signed in to change notification settings - Fork 9.9k
Modernize datadir #4372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Modernize datadir #4372
Conversation
zdenop
commented
Dec 18, 2024
- use std::filesystem::path instead of std::string for datadir
- add warning if datadir is not directory or does not exists
…g because it points at a temporary instance that was destroyed.`)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably all is_directory() checks are unnecessary. User will just get any other random error if he points to a file instead of to a dir.
@stweil : Thank you for review. Unittests needs be fixed yet. |
Old tprintfs should be removed slowly in favor of std::format. |
Signed-off-by: Stefan Weil <[email protected]>
Isn't the tesserr stream even better? |
Streams are better than tprintf.
compared to
Having even more insertion points the original string can be hardly readable compared to a string with |
I mean with |
I've slightly updated |
Now I get numerous compiler warnings:
And the code size is larger, too. |
|
Can you try changing it to
? |
The suggested change still produces the compiler warnings, but I think it improves the code because like that calling |
It will be an error too if we call
|
Added format in d95e9f7 I'll check ci for other warnings. |
I don't see warnings on gcc-14 (even without format). Possible solution is
|
@egorpugin : I cherry-picked your commits regarding tprintf to my local branch it does not work for 'tessdata_path.string()'. |
How do you use it? |
I am sorry, it looks like I messed something up... Today I started from clean setup and it worked for me:
|
I looks like unittest fails if If I made simplified test based on
|
Ignore unit test errors, they can be fixed after. |
src/api/baseapi.cpp
Outdated
@@ -146,6 +146,9 @@ static void ExtractFontName(const char* filename, std::string* fontname) { | |||
*/ | |||
static void addAvailableLanguages(const std::string &datadir, | |||
std::vector<std::string> *langs) { | |||
if (!std::filesystem::is_directory(datadir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still want this check? My impression is that we agreed to prefer reporting the exception which is thrown without this conditional code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If throwing an exception is the agreed-upon approach, then this check is no longer necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot merge this PR as long as several CI tests fail. Some of them seem to be related to layout recognition (for example layout_test) and could indicate a severe regression.
*/ | ||
void CCUtil::main_setup(const std::string &argv0, const std::string &basename) { | ||
imagebasename = basename; /**< name of image */ | ||
datadir = find_data_path(argv0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old datadir
was always terminated with a '/' which is missing for the new datadir
. This is an API change, but it might also be it is not the reason for the failing CI tests.
tprintf("Error opening data file %s\n", tessdata_path.c_str()); | ||
tprintf( | ||
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.string().c_str())) { | ||
tesserr << "Error opening data file " << tessdata_path.string() << '\n' << |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tesserr << "Error opening data file " << tessdata_path.string() << '\n' << | |
tesserr << "Error opening data file " << tessdata_path << '\n' << |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the difference in path formatting between using tesseract.exe a b -l c
with tessdata_path.string()
versus tessdata_path
:
tessdata_path.string()
produces a native OS path (e.g., on Windows):
Error opening data file X:\Projects\tesseract\c.traineddata
tessdata_path
produces an escaped path:
Error opening data file "X:\\Projects\\tesseract\\c.traineddata"
I believe it’s more user-friendly and intuitive to display the native OS path. Is there a specific reason we should use the escaped version instead? If not, I recommend sticking with the native path for error messages.
"not present in %s!!\n", | ||
tessdata_path.c_str()); | ||
tesserr << "Error: Tesseract (legacy) engine requested, but components are " | ||
"not present in " << tessdata_path.string() << "!!\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not present in " << tessdata_path.string() << "!!\n"; | |
"not present in " << tessdata_path << "!!\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above: I recommend sticking with the native path for error messages
*/ | ||
void CCUtil::main_setup(const std::string &argv0, const std::string &basename) { | ||
imagebasename = basename; /**< name of image */ | ||
datadir = find_data_path(argv0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datadir = find_data_path(argv0); | |
datadir = find_data_path(argv0) / ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, datadir
was a string
, and appending a /
was used as a workaround to handle cases where users omitted the directory separator.
However, with datadir
now being a std::filesystem::path
, this manual string manipulation is no longer necessary. The path class automatically normalizes path separators and handles path operations (like concatenation) correctly for the underlying OS.
As a result, we can safely remove the workaround and rely on the standard library to manage path separators and concatenation.
Or do I miss something?
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.c_str())) { | ||
tprintf("Error opening data file %s\n", tessdata_path.c_str()); | ||
tprintf( | ||
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.string().c_str())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.string().c_str())) { | |
if (!mgr->is_loaded() && !mgr->Init(tessdata_path.c_str())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change would broke msvc build:
error C2664: 'bool tesseract::TessdataManager::Init(const char *)': cannot convert argument 1 from 'const std: :filesystem::path::value_type *' to 'const char *'
@@ -81,17 +83,13 @@ bool Tesseract::init_tesseract_lang_data(const std::string &arg0, | |||
bool set_only_non_debug_params, TessdataManager *mgr) { | |||
// Set the language data path prefix | |||
lang = !language.empty() ? language : "eng"; | |||
language_data_path_prefix = datadir; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
language_data_path_prefix
is a class variable which is no longer set in the new code. This causes the CI failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed 2a296fa