Skip to content

How to project and filter a Dataset simultaneously (out-of-core)? #46781

Open
@JosephWagner

Description

@JosephWagner

Describe the usage question you have. Please include as many useful details as possible.

Hi all,

I'm uncertain how to filter a Dataset.scanner using projected/transformed columns that are not present in the underlying data. I know that I can create a table as an intermediate step, but I am working with data larger than memory so I'd like to keep things out-of-core if possible. What is the idiomatic way to do this?

My best attempt so far is to separate the projection and filtering into 2 separate actions. Am I correct in thinking that creating a projected RecordBatchReader and then a scanner from that will result in out-of-core projection and filtering?

from pyarrow import compute as pc
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'year': [2019, 2020, 2021]})
pq.write_table(table, "dataset_scanner.parquet")
dataset = ds.dataset("dataset_scanner.parquet")

print("filtered dataset year+1> 2020:")
print("attempt chained scanning to compute year_plus_one first")
projected = dataset.scanner(columns={"year": ds.field("year"), "year_plus_one": pc.add(ds.field("year"), 1)}).to_reader()
print(ds.Scanner.from_batches(projected, filter=ds.field("year_plus_one") > 2020).to_table())

##print("Filtered dataset year+1> 2020:")
##print("fails because year_plus_one does not exist in schema to filter with")
##print(dataset.scanner(columns={"year": ds.field("year"), "year_plus_one": pc.add(ds.field("year"), 1)}, filter=ds.field("year_plus_one") > 2020).to_table())

My first attempt (commented out above) tried to do everything in one operation and it returns this error:

File "pyarrow/_dataset.pyx", line 415, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 3676, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 3589, in pyarrow._dataset.Scanner._make_scan_options
  File "pyarrow/_dataset.pyx", line 3519, in pyarrow._dataset._populate_builder
  File "pyarrow/_compute.pyx", line 2864, in pyarrow._compute._bind
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(year_plus_one) in year: int64

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions