-
Notifications
You must be signed in to change notification settings - Fork 16
Support writing directly to parquet files #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
737fbfd
to
7b9d6dc
Compare
3272c4a
to
32b800f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for writing directly to Parquet files with parallel generation and configurable compression, improving performance and resource usage.
- Adds a new Parquet output format implementation in tpchgen-cli.
- Introduces a new statistics reporter for performance logging.
- Updates the generation logic to support configurable thread counts.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
tpchgen/Cargo.toml | Minor dependency update. |
tpchgen-cli/src/statistics.rs | Adds a WriteStatistics struct with performance logging. |
tpchgen-cli/src/parquet.rs | Implements Parquet file generation with multithreading. |
tpchgen-cli/src/main.rs | Integrates Parquet support and updates CLI options. |
tpchgen-cli/src/generate.rs | Updates parallel generation to use configurable threads. |
tpchgen-cli/Cargo.toml | Adds required dependencies for Arrow and Parquet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks @alamb
Thanks @clflushopt |
This PR adds the ability to write directly generate parquet files using all available cores with minimal buffering
Among other things it can generate the entire TPCH SF 100 dataset in less than a minute and less than 400MB peak memory usage (as measured by top).
My measurements show it can do about 0.5 G/sec for
ZSTD
parquetExample
Then check out the output
TODO: