Add PDS-DS Query 1 #19131

Matt711 · 2025-06-10T20:52:03Z

Description

Contributes to #19125.

The new query (duckDB and Polars) is in pdsds.py
The changes to pdsh.py just move out common logic that it shares with pdsds.py

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-06-10T20:52:07Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Matt711 · 2025-06-11T13:40:30Z

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsh.py

-                    except AssertionError as e:
-                        validation_failures.append(q_id)
-                        print(f"❌ Query {q_id} failed validation!\n{e}")
+            if args.validate and run_config.executor != "cpu":


Sign post: There are no duckdb queries in this module to validate against, which is why we validate against CPU for pdsh.

wence- · 2025-06-19T16:27:11Z

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds.py

+            .select(["c_customer_id"])
+            .sort("c_customer_id")
+            .limit(100)
+        )


There are more, and each individual query is longer than, the pdsh versions. I think we should make a module with one file per query and import them here.

that is, a directory structure: benchmarks/pdsds/{q1, ..., q99}.py etc. That would also make it easier to navigate and review the individual queries.

wence- · 2025-06-19T16:32:41Z

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds.py

+    for filename in os.listdir(dataset_path):
+        if filename.endswith(".parquet"):
+            table_name = filename.replace(".parquet", "")
+            parquet_path = Path(dataset_path) / filename
+            create_view_sql = f"CREATE VIEW {table_name} AS SELECT * FROM read_parquet('{parquet_path}');"
+            create_statements.append(create_view_sql)


statements = [ f"CREATE VIEW {table.stem} as SELECT * FROM read_parquet('{table.absolute()}')" for table in Path(dataset_path).glob("*.parquet") ] statements.append(query) return conn.execute("\n".join(statements)).pl()

?

wence- · 2025-06-19T16:34:04Z

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds.py

+        print(f"Completed {q_id} in {t1 - t0:.2f} seconds")
+
+        if args.print_results:
+            print(result)


Please track and store, in structured format, the timing and other relevant run config data. (this is done for the polars run below, I suppose we should do it for duckdb as well).

wence- · 2025-06-19T16:35:21Z

python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds.py

+
+    for q_id in run_config.queries:
+        try:
+            q = getattr(PDSDSPolarsQueries, f"q{q_id}")(run_config)


This is the only line that is different from the equivalent function in pdsh.py. Let's refactor so we have a function run_polars(q: pl.LazyFrame, options: ...) and can reuse that in both files.

That way we don't have two places where we have to remember to update config options for the executors.

wence- · 2025-06-19T16:36:02Z

python/cudf_polars/cudf_polars/experimental/benchmarks/utils.py

+        color="green",
+    ):
+        if run_config.executor == "cpu":
+            return q.collect(new_streaming=True)


This should be:

Suggested change

return q.collect(new_streaming=True)

return q.collect(engine="streaming")

wence- · 2025-06-19T16:37:01Z

python/cudf_polars/cudf_polars/experimental/benchmarks/utils.py

+
+        else:
+            raise RuntimeError(
+                "Cannot provide debug information because cudf_polars is not installed."


I know this is code movement, but this is a weird error message. I think the message should just report that the requested engine is not supported?

wence- · 2025-06-19T16:39:39Z

python/cudf_polars/cudf_polars/experimental/benchmarks/utils.py

+    if run_config.executor == "cpu":
+        if args.explain_logical:
+            print(f"\nQuery {q_id} - Logical plan\n")
+            print(q.explain())


One can explain the streaming physical plan with q.show_graph(engine="streaming", plan_stage="physical") (needs graphviz...)

Matt711 added 3 commits June 10, 2025 10:44

Move pdsh utility functions/classes to a seperate module

9a411fc

merge conflict

2c44f13

Make more utility functions for pdsh benchmarks

43f067e

Matt711 added feature request New feature or request non-breaking Non-breaking change labels Jun 10, 2025

github-actions bot assigned Matt711 Jun 10, 2025

github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jun 10, 2025

github-project-automation bot added this to cuDF Python Jun 10, 2025

GPUtester moved this to In Progress in cuDF Python Jun 10, 2025

add query 1

c39dc99

Matt711 marked this pull request as ready for review June 10, 2025 23:35

Matt711 requested a review from a team as a code owner June 10, 2025 23:35

Matt711 requested review from vyasr and bdice June 10, 2025 23:35

Matt711 and others added 2 commits June 11, 2025 08:49

Merge branch 'branch-25.08' into imp/polars/pdsh/refactor

646b1c4

code cov

d2ef8ee

Matt711 requested a review from a team as a code owner June 11, 2025 13:31

mypy check

b984b76

Matt711 commented Jun 11, 2025

View reviewed changes

Merge branch 'branch-25.08' into imp/polars/pdsh/refactor

5f8e8f5

GregoryKimball requested a review from wence- June 19, 2025 15:13

wence- reviewed Jun 19, 2025

View reviewed changes

wence- requested changes Jun 19, 2025

View reviewed changes

merge conflict

1a5e307

Matt711 requested review from a team as code owners June 25, 2025 14:59

Matt711 marked this pull request as draft June 25, 2025 15:00

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas pylibcudf Issues specific to the pylibcudf package labels Jun 25, 2025

merge conflict

7b33a43

github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PDS-DS Query 1 #19131

Add PDS-DS Query 1 #19131

Matt711 commented Jun 10, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jun 10, 2025

Uh oh!

Matt711 Jun 11, 2025

Uh oh!

wence- Jun 19, 2025

Uh oh!

wence- Jun 19, 2025

Uh oh!

wence- Jun 19, 2025

Uh oh!

wence- Jun 19, 2025

Uh oh!

wence- Jun 19, 2025

Uh oh!

wence- Jun 19, 2025

Uh oh!

wence- Jun 19, 2025

Uh oh!

Uh oh!

	return q.collect(new_streaming=True)
	return q.collect(engine="streaming")

Add PDS-DS Query 1 #19131

Are you sure you want to change the base?

Add PDS-DS Query 1 #19131

Conversation

Matt711 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Jun 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Matt711 commented Jun 10, 2025 •

edited

Loading