-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52450][CONNECT] Improve performance of schema deepcopy #51157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @hvanhovell @vicennial , could you take a look at this PR? Thanks. |
try: | ||
self._cached_schema_serialized = CPickleSerializer().dumps(self._schema) | ||
except Exception as e: | ||
logger.warn(f"DataFrame schema pickle dumps failed with exception: {e}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases do we think this will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it never happens because schema is nested Spark type classes. It shouldn't hit any of those special types that pickle doesn't support (link). Anyway, maybe we still need to handle the exception just in case. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add some comments to the code to make it easier for future readers to understand. Currently, it's not very clear why a CPickleSerializer is used, or why an error is being handled without looking at the corresponding pull request.
Also, it may be clearer to create a function called _fast_cached_schema_deepcopy that caches the serialized schema and then deserializes it.
@@ -1836,11 +1839,26 @@ def _schema(self) -> StructType: | |||
if self._cached_schema is None: | |||
query = self._plan.to_proto(self._session.client) | |||
self._cached_schema = self._session.client.schema(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we store the proto response instead of doing the pickle? In that case we are guaranteed it will always work. Does that have the same performance problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell Just tested. types.proto_schema_to_pyspark_data_type
took 6.1s to transform the proto schema 100 times. Compared with it, pickle is faster - 2.0s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for me. thanks for looking into it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging would be great to test @hvanhovell's proposal to test the performance of reconstructing the schema from the proto response at all times and what the impact is.
Co-authored-by: Martin Grund <[email protected]>
What changes were proposed in this pull request?
In Spark Connect,
DataFrame.schema
returns a deep copy of the schema to prevent unexpected behavior caused by user modifications to the returned schema object. However, if a user accessesdf.schema
repeatedly on a DataFrame with a complex schema, it can lead to noticeable performance degradation.The performance issue can be reproduced using the code snippet below. Since copy.deepcopy is known to be slow to handle complex objects, this PR replaces it with pickle-based ser/de to improve the performance of df.schema access. Given the limitations of pickle, the implementation falls back to deepcopy in cases where pickling fails.
The existing needs 21.9s to copy the schema for 100 times.
The updated approach only needs 2.0s to copy for 100 times:
Why are the changes needed?
It improves the performance when calling df.schema many times.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests and new tests.
Was this patch authored or co-authored using generative AI tooling?
No.