Skip to content

Support writing directly to parquet files #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 29, 2025

Conversation

alamb
Copy link
Collaborator

@alamb alamb commented Mar 24, 2025

This PR adds the ability to write directly generate parquet files using all available cores with minimal buffering

Among other things it can generate the entire TPCH SF 100 dataset in less than a minute and less than 400MB peak memory usage (as measured by top).

My measurements show it can do about 0.5 G/sec for ZSTD parquet

Example

 andrewlamb@Andrews-MacBook-Pro-2:~/Software/tpchgen-rs$ time target/release/tpchgen-cli -v --scale-factor=100 --format=parquet --output-dir=/tmp/repo
[2025-03-28T00:30:52Z INFO  tpchgen_cli] Verbose output enabled (ignoring RUST_LOG environment variable)
[2025-03-28T00:30:53Z INFO  tpchgen_cli] Created static distributions and text pools in 806.286541ms
[2025-03-28T00:30:53Z INFO  tpchgen_cli] Writing table nation (SF=100) to nation.parquet
[2025-03-28T00:30:53Z INFO  tpchgen_cli::statistics] Completed in 461.291µs (0.01 GB/sec)
[2025-03-28T00:30:53Z INFO  tpchgen_cli] Writing table region (SF=100) to region.parquet
[2025-03-28T00:30:53Z INFO  tpchgen_cli::statistics] Completed in 175.75µs (0.01 GB/sec)
[2025-03-28T00:30:53Z INFO  tpchgen_cli] Writing table part (SF=100) to part.parquet
[2025-03-28T00:30:54Z INFO  tpchgen_cli::statistics] Completed in 1.829104083s (0.26 GB/sec)
[2025-03-28T00:30:54Z INFO  tpchgen_cli] Writing table supplier (SF=100) to supplier.parquet
[2025-03-28T00:30:55Z INFO  tpchgen_cli::statistics] Completed in 129.034209ms (0.43 GB/sec)
[2025-03-28T00:30:55Z INFO  tpchgen_cli] Writing table partsupp (SF=100) to partsupp.parquet
[2025-03-28T00:30:59Z INFO  tpchgen_cli::statistics] Completed in 3.936936958s (0.73 GB/sec)
[2025-03-28T00:30:59Z INFO  tpchgen_cli] Writing table customer (SF=100) to customer.parquet
[2025-03-28T00:31:00Z INFO  tpchgen_cli::statistics] Completed in 1.129056875s (0.76 GB/sec)
[2025-03-28T00:31:00Z INFO  tpchgen_cli] Writing table orders (SF=100) to orders.parquet
[2025-03-28T00:31:09Z INFO  tpchgen_cli::statistics] Completed in 8.914675s (0.53 GB/sec)
[2025-03-28T00:31:09Z INFO  tpchgen_cli] Writing table lineitem (SF=100) to lineitem.parquet
[2025-03-28T00:31:46Z INFO  tpchgen_cli::statistics] Completed in 36.228818167s (0.54 GB/sec)
[2025-03-28T00:31:46Z INFO  tpchgen_cli] Generation complete!

real	0m53.795s
user	10m23.254s
sys	1m47.903s

Then check out the output

andrewlamb@Andrews-MacBook-Pro-2:~/Software/tpchgen-rs$ datafusion-cli -c "select * from '/tmp/repo/lineitem.parquet' limit 10"
DataFusion CLI v46.0.1
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+----------------------------------------+
| l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct    | l_shipmode | l_comment                              |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+----------------------------------------+
| 112673607  | 14299286  | 549301    | 1            | 29.00      | 37252.53        | 0.08       | 0.06  | N            | O            | 1996-01-12 | 1995-12-10   | 1996-02-10    | DELIVER IN PERSON | MAIL       | lly bold ideas ar                      |
| 112673607  | 8086586   | 586603    | 2            | 1.00       | 1572.18         | 0.10       | 0.05  | N            | O            | 1995-10-15 | 1995-12-01   | 1995-11-13    | DELIVER IN PERSON | MAIL       | ts sleep carefully a                   |
| 112673607  | 11234349  | 234350    | 3            | 33.00      | 42331.74        | 0.06       | 0.08  | N            | O            | 1995-11-23 | 1995-12-28   | 1995-12-20    | COLLECT COD       | AIR        | ngly pending theodolites lose          |
| 112673607  | 15699713  | 449759    | 4            | 44.00      | 75324.92        | 0.01       | 0.01  | N            | O            | 1995-12-25 | 1995-12-29   | 1996-01-05    | DELIVER IN PERSON | FOB        | inal plate                             |
| 112673632  | 6085764   | 85765     | 1            | 50.00      | 87473.00        | 0.05       | 0.06  | N            | O            | 1996-04-22 | 1996-07-02   | 1996-05-01    | DELIVER IN PERSON | MAIL       |  ironic de                             |
| 112673632  | 3422639   | 922646    | 2            | 40.00      | 62458.40        | 0.10       | 0.05  | N            | O            | 1996-04-30 | 1996-05-29   | 1996-05-17    | COLLECT COD       | MAIL       |  ironic ide                            |
| 112673632  | 4283483   | 33496     | 3            | 29.00      | 42521.83        | 0.09       | 0.08  | N            | O            | 1996-06-11 | 1996-05-14   | 1996-06-17    | DELIVER IN PERSON | SHIP       | odolites above the s                   |
| 112673633  | 12683461  | 433498    | 1            | 9.00       | 12994.47        | 0.06       | 0.05  | N            | O            | 1997-09-03 | 1997-10-08   | 1997-09-30    | NONE              | REG AIR    | s. carefully even accounts haggle care |
| 112673633  | 10912554  | 162565    | 2            | 41.00      | 64206.41        | 0.05       | 0.00  | N            | O            | 1997-11-30 | 1997-11-27   | 1997-12-14    | DELIVER IN PERSON | AIR        | lithely blithely pending platelets. fu |
| 112673633  | 7287863   | 537871    | 3            | 14.00      | 25907.00        | 0.02       | 0.00  | N            | O            | 1997-09-22 | 1997-10-27   | 1997-10-16    | NONE              | MAIL       | xes use slyly special deposits. q      |
+------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+-------------------+------------+----------------------------------------+
10 row(s) fetched.
Elapsed 0.140 seconds.

andrewlamb@Andrews-MacBook-Pro-2:~/Software/tpchgen-rs$ du -s -h /tmp/repo/*.parquet
896M	/tmp/repo/customer.parquet
 20G	/tmp/repo/lineitem.parquet
4.0K	/tmp/repo/nation.parquet
4.8G	/tmp/repo/orders.parquet
480M	/tmp/repo/part.parquet
2.9G	/tmp/repo/partsupp.parquet
4.0K	/tmp/repo/region.parquet
 57M	/tmp/repo/supplier.parquet

TODO:

  • Avoid so many string copies for low cardinality string values (precompute StringViews) faster
  • Tests somehow

@alamb alamb changed the title Alamb/parquet (POC) Support writing directly to parquet files Mar 24, 2025
@alamb alamb force-pushed the alamb/parquet branch 3 times, most recently from 737fbfd to 7b9d6dc Compare March 25, 2025 13:53
@alamb alamb force-pushed the alamb/parquet branch 2 times, most recently from 3272c4a to 32b800f Compare March 27, 2025 15:22
@alamb alamb changed the title (POC) Support writing directly to parquet files Support writing directly to parquet files Mar 28, 2025
@alamb alamb marked this pull request as ready for review March 28, 2025 10:34
@alamb alamb requested review from Copilot and clflushopt March 28, 2025 10:34
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for writing directly to Parquet files with parallel generation and configurable compression, improving performance and resource usage.

  • Adds a new Parquet output format implementation in tpchgen-cli.
  • Introduces a new statistics reporter for performance logging.
  • Updates the generation logic to support configurable thread counts.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tpchgen/Cargo.toml Minor dependency update.
tpchgen-cli/src/statistics.rs Adds a WriteStatistics struct with performance logging.
tpchgen-cli/src/parquet.rs Implements Parquet file generation with multithreading.
tpchgen-cli/src/main.rs Integrates Parquet support and updates CLI options.
tpchgen-cli/src/generate.rs Updates parallel generation to use configurable threads.
tpchgen-cli/Cargo.toml Adds required dependencies for Arrow and Parquet.

Copy link
Owner

@clflushopt clflushopt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @alamb

@alamb alamb merged commit 39b125f into clflushopt:main Mar 29, 2025
7 checks passed
@alamb alamb deleted the alamb/parquet branch March 29, 2025 10:17
@alamb
Copy link
Collaborator Author

alamb commented Mar 29, 2025

Thanks @clflushopt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Directly write parquet
2 participants