Skip to content

Add PDS-DS Query 1 #19131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: branch-25.08
Choose a base branch
from

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Jun 10, 2025

Description

Contributes to #19125.

  • The new query (duckDB and Polars) is in pdsds.py
  • The changes to pdsh.py just move out common logic that it shares with pdsds.py

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added feature request New feature or request non-breaking Non-breaking change labels Jun 10, 2025
Copy link

copy-pr-bot bot commented Jun 10, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jun 10, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 10, 2025
@Matt711 Matt711 marked this pull request as ready for review June 10, 2025 23:35
@Matt711 Matt711 requested a review from a team as a code owner June 10, 2025 23:35
@Matt711 Matt711 requested review from vyasr and bdice June 10, 2025 23:35
@Matt711 Matt711 requested a review from a team as a code owner June 11, 2025 13:31
except AssertionError as e:
validation_failures.append(q_id)
print(f"❌ Query {q_id} failed validation!\n{e}")
if args.validate and run_config.executor != "cpu":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign post: There are no duckdb queries in this module to validate against, which is why we validate against CPU for pdsh.

@GregoryKimball GregoryKimball requested a review from wence- June 19, 2025 15:13
.select(["c_customer_id"])
.sort("c_customer_id")
.limit(100)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more, and each individual query is longer than, the pdsh versions. I think we should make a module with one file per query and import them here.

that is, a directory structure: benchmarks/pdsds/{q1, ..., q99}.py etc. That would also make it easier to navigate and review the individual queries.

Comment on lines +141 to +146
for filename in os.listdir(dataset_path):
if filename.endswith(".parquet"):
table_name = filename.replace(".parquet", "")
parquet_path = Path(dataset_path) / filename
create_view_sql = f"CREATE VIEW {table_name} AS SELECT * FROM read_parquet('{parquet_path}');"
create_statements.append(create_view_sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statements = [
    f"CREATE VIEW {table.stem} as SELECT * FROM read_parquet('{table.absolute()}')"
    for table in Path(dataset_path).glob("*.parquet")
]
statements.append(query)
return conn.execute("\n".join(statements)).pl()

?

print(f"Completed {q_id} in {t1 - t0:.2f} seconds")

if args.print_results:
print(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please track and store, in structured format, the timing and other relevant run config data. (this is done for the polars run below, I suppose we should do it for duckdb as well).


for q_id in run_config.queries:
try:
q = getattr(PDSDSPolarsQueries, f"q{q_id}")(run_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only line that is different from the equivalent function in pdsh.py. Let's refactor so we have a function run_polars(q: pl.LazyFrame, options: ...) and can reuse that in both files.

That way we don't have two places where we have to remember to update config options for the executors.

color="green",
):
if run_config.executor == "cpu":
return q.collect(new_streaming=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be:

Suggested change
return q.collect(new_streaming=True)
return q.collect(engine="streaming")


else:
raise RuntimeError(
"Cannot provide debug information because cudf_polars is not installed."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is code movement, but this is a weird error message. I think the message should just report that the requested engine is not supported?

if run_config.executor == "cpu":
if args.explain_logical:
print(f"\nQuery {q_id} - Logical plan\n")
print(q.explain())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can explain the streaming physical plan with q.show_graph(engine="streaming", plan_stage="physical") (needs graphviz...)

@Matt711 Matt711 requested review from a team as code owners June 25, 2025 14:59
@Matt711 Matt711 marked this pull request as draft June 25, 2025 15:00
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Jun 25, 2025
@github-actions github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.pandas Issues specific to cudf.pandas cudf-polars Issues specific to cudf-polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants