Skip to content

Spark: Avoid closing deserialized copies of shared resources like FileIO #12868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xiaoxuandev
Copy link
Contributor

@xiaoxuandev xiaoxuandev commented Apr 22, 2025

This change prevents calling close() on FileIO during the cleanup of Spark's broadcast variable when memoryStore.remove(blockId) is called. Closing FileIO can unintentionally shut down shared resources—such as a shared connection pool—when S3FileIO is backed by the Apache HTTPClient. Calling close() will trigger the shutdown of the HttpClientConnectionManager, leading to request failures if other instances are still in use.
Fixes: #12858, #12046

@xiaoxuandev xiaoxuandev force-pushed the fix-close-on-executor branch from 035f3b8 to 4daa941 Compare April 22, 2025 19:15
Copy link

@bk-mz bk-mz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines -68 to +69
LOG.info("Releasing resources");
io().close();
LOG.info("Executor-side cleanup: closing deserialized table resources");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to ensure that the io and hence the pool is eventually closed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Based on code inspection and some local debugging, Iceberg doesn’t explicitly call close() on most FileIO instances (i.e., the regular, non-deserialized ones). The exception is S3FileIO, which overrides finalize() to invoke close() during garbage collection—and this applies to deserialized copies as well.
And given that the underlying connection pool maybe shared, we might want to consider removing finalize() from S3FileIO to avoid unintended side effects during garbage collection. cc: @rdblue @aokolnychyi

Side note: finalize() was deprecated in Java 9 due to potential performance issues, deadlocks, and unpredictable behavior during GC.

@mgmarino
Copy link

Thanks, @xiaoxuandev this looks to work similarly to the PR that I opened to fix this issue here: #12129, but I was unable to come up with tests there. Would love to get this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SparkMicroBatchStream] Executors prematurely close I/O client during Spark broadcast cleanup
5 participants