-
Notifications
You must be signed in to change notification settings - Fork 385
Use Simdjson ondemand parser instead of DOM parser #3878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3878 +/- ##
=======================================
Coverage ? 62.72%
=======================================
Files ? 299
Lines ? 36721
Branches ? 2756
=======================================
Hits ? 23033
Misses ? 13636
Partials ? 52 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
first non-working "naive" conversion, compiles but never ends and allocates more than 11Gb of memory
…ingthe info field in json properly
8d32c21
to
3479ad2
Compare
// BEWARE: | ||
// We use below `simdjson`'s "on-demand" parser, which does not tolerate reading the same | ||
// value more than once. This means we need to make sure that the objects and their fields | ||
// are read and/or concretized only once and if we need to use them more than once we need | ||
// to persist them in local memory. This is why the code below tries hard to pre-read the | ||
// data needed in several parts of the computing in a way that prevents jumping up and down | ||
// the hierarchy of json objects. When this rule is not followed, the parsing might end | ||
// earlier than expected or might skip data that are read when they shouldn't be, leading to | ||
// *runtime issues* that might not be visible at first. Because of these reasons, be careful | ||
// when modifying the following parsing code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for mentioning this.
Running
main
'stest_libmamba
executable currently allocates (but dont use) more than 5Gb when parsing 320Mb sized JSON files. This PR reduces this to around 2.5Gb. (observed using Visual Studio, in both Release and RelWithDebugInfo, also got some confirmation that this is observable on Linux)flat_set<T>
to allow.contains(value)
wherevalue
has a different type thanT
but is still comparable.