You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bulk import jobs use Spark to partition the data and then sort the data within partitions. The partitioning is the same as the current leaf partitions in the Sleeper table. This makes sense in most situations. However, if there are a large number of partitions then it can take a long time to write one file per leaf partition. It would make more sense to choose some partitions that are not the leaf partitions and write one file for each of those.
Description
Add an option to bulk import jobs to specify how many partitions should be used for writing the data. Go down the tree until a level has been identified that has roughly that many partitions. Run the bulk import job partitioning the data according to those partitions. At the end of the job, for each file F that has been written for a partition P, we add file references to F for each leaf partition L that is underneath P.
The text was updated successfully, but these errors were encountered:
Background
Bulk import jobs use Spark to partition the data and then sort the data within partitions. The partitioning is the same as the current leaf partitions in the Sleeper table. This makes sense in most situations. However, if there are a large number of partitions then it can take a long time to write one file per leaf partition. It would make more sense to choose some partitions that are not the leaf partitions and write one file for each of those.
Description
Add an option to bulk import jobs to specify how many partitions should be used for writing the data. Go down the tree until a level has been identified that has roughly that many partitions. Run the bulk import job partitioning the data according to those partitions. At the end of the job, for each file F that has been written for a partition P, we add file references to F for each leaf partition L that is underneath P.
The text was updated successfully, but these errors were encountered: