Skip to content

Bulk import should be able to import at higher levels of the partition tree #4513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gaffer01 opened this issue Mar 24, 2025 · 0 comments
Open
Labels
bulk-import-module enhancement New feature or request

Comments

@gaffer01
Copy link
Member

Background

Bulk import jobs use Spark to partition the data and then sort the data within partitions. The partitioning is the same as the current leaf partitions in the Sleeper table. This makes sense in most situations. However, if there are a large number of partitions then it can take a long time to write one file per leaf partition. It would make more sense to choose some partitions that are not the leaf partitions and write one file for each of those.

Description

Add an option to bulk import jobs to specify how many partitions should be used for writing the data. Go down the tree until a level has been identified that has roughly that many partitions. Run the bulk import job partitioning the data according to those partitions. At the end of the job, for each file F that has been written for a partition P, we add file references to F for each leaf partition L that is underneath P.

@gaffer01 gaffer01 added bulk-import-module enhancement New feature or request labels Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bulk-import-module enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant