Bulk import should be able to import at higher levels of the partition tree #4513

gaffer01 · 2025-03-24T09:37:24Z

Background

Bulk import jobs use Spark to partition the data and then sort the data within partitions. The partitioning is the same as the current leaf partitions in the Sleeper table. This makes sense in most situations. However, if there are a large number of partitions then it can take a long time to write one file per leaf partition. It would make more sense to choose some partitions that are not the leaf partitions and write one file for each of those.

Description

Add an option to bulk import jobs to specify how many partitions should be used for writing the data. Go down the tree until a level has been identified that has roughly that many partitions. Run the bulk import job partitioning the data according to those partitions. At the end of the job, for each file F that has been written for a partition P, we add file references to F for each leaf partition L that is underneath P.

gaffer01 added bulk-import-module enhancement New feature or request labels Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk import should be able to import at higher levels of the partition tree #4513

Bulk import should be able to import at higher levels of the partition tree #4513

gaffer01 commented Mar 24, 2025

Bulk import should be able to import at higher levels of the partition tree #4513

Bulk import should be able to import at higher levels of the partition tree #4513

Comments

gaffer01 commented Mar 24, 2025

Background

Description