Inject SDK-side flattens while handling input/output coder mismatch in flattens. #34641

shunping · 2025-04-16T03:26:14Z

Following the idea of identity transform (#32930 (comment)) and valuable discussions with @lostluck, I have developed a "simpler and more general fix" for Prism runner issues involving Flatten. This approach works not only for Flatten->Flatten scenarios, but also for others like GroupBy->Flatten.

The key observations are:

Prism's Runner Flatten only passes data through from its input PCollections.
No subsequent step within the runner enforces the correct output coder for this Runner Flatten output.

Previous approaches (including #34602) attempted to fix this by by overwriting the upstream coders with the Flatten output coder. This works in many scenarios, especially when the upstream transform is a SDK-side transform or another Flatten (where #34602 ensured coder propagation).

However, modifying upstream PCollection coders can cause side effects.

In the groupby->flatten example (org.apache.beam.sdk.transforms.FlattenTest.testFlattenWithDifferentInputAndOutputCoders2), changing GroupBy's coder has no effect on the actual encoded data, because GroupBy always generate K,Iterable<V>. Downstream transforms expecting data encoded with the Flatten's coder then fail during decoding.
Another example (see issue [Bug]: Prism crashed if Flatten and GroupByKey share the same input #34643) involves a pipeline like C = [A, B] | Flatten(), D = A | GroupByKey(). If Flatten overwrites the coder associated with PCollection A, the GroupByKey on A may fail due to coder incompatibility.

The proposed solution avoids the pitfalls of modifying upstream coders. Instead, it inserts an identity transform (specifically, an SDK-side Flatten) before the Runner Flatten. This identity transform implicitly converts the input PCollection's elements to use the same output coder as the target Runner Flatten, ensuring the correctly encoded data emitted from it.

fixes #32930
fixes #34643

The PR also fixes the flaky test of Test_preprocessor_preProcessGraph/ignoreEmptyAndIdentityTransform under https://github.com/apache/beam/actions/workflows/beam_PreCommit_Go.yml.

github-actions · 2025-04-16T04:08:09Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label go.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

…the fix. The test is also included in the test suite of flink, samza and spark, but without transcoding until their corresponding FRs are resolved.

lostluck

I like how clean this ended up. Great work.

lostluck · 2025-04-16T16:07:56Z

sdks/go/pkg/beam/runners/prism/internal/handlerunner.go

@@ -88,8 +88,7 @@ func (h *runner) PrepareTransform(tid string, t *pipepb.PTransform, comps *pipep
 }

 func (h *runner) handleFlatten(tid string, t *pipepb.PTransform, comps *pipepb.Components) prepareResult {
-	if !h.config.SDKFlatten {
-		t.EnvironmentId = ""         // force the flatten to be a runner transform due to configuration.
+	if !h.config.SDKFlatten && !strings.HasPrefix(tid, "ft_") {


I'll note that there's no user serviceable way to do these configurations at the moment, and it really was a hard binary. It would be acceptable to remove the SDKFlatten option in favour of just a single approach that biases to runner flattens, but does the SDK flatten to get around these issues.

lostluck · 2025-04-16T21:21:02Z

sdks/go/pkg/beam/runners/prism/internal/preprocess.go

@@ -492,6 +492,9 @@ func finalizeStage(stg *stage, comps *pipepb.Components, pipelineFacts *fusionFa
 	}

 	stg.internalCols = internal
+	// Sort the keys of internal producers (from stageFacts.PcolProducers)
+	// to ensure deterministic order for stable tests.
+	sort.Strings(stg.internalCols)


Good find! I thought I had everything deterministic already.

shunping added 2 commits April 15, 2025 23:22

A more general fix on handling flatten by injecting sdk-side flatten.

e7297b6

Re-enable a previously failed flatten test in java.

af47d01

github-actions bot added go runners prism labels Apr 16, 2025

shunping marked this pull request as ready for review April 16, 2025 04:02

shunping requested a review from lostluck April 16, 2025 04:03

github-actions bot added the Next Action: Reviewers label Apr 16, 2025

shunping self-assigned this Apr 16, 2025

shunping mentioned this pull request Apr 16, 2025

[Bug]: Prism crashed if Flatten and GroupByKey share the same input #34643

Closed

17 tasks

shunping changed the title ~~Inject SDK-side flattens while handling coder mismatch in flattens.~~ Inject SDK-side flattens while handling input/output coder mismatch in flattens. Apr 16, 2025

Add a new test to cover another case that would crash prism prior to …

df362c7

…the fix. The test is also included in the test suite of flink, samza and spark, but without transcoding until their corresponding FRs are resolved.

github-actions bot added python and removed python labels Apr 16, 2025

Fix a flaky test by sorting the keys of internal producers.

369859b

shunping force-pushed the prism-sdk-side-flatten-injection branch from 9d6ec0a to 369859b Compare April 16, 2025 17:11

github-actions bot added python and removed python labels Apr 16, 2025

Skip the new flatten-gbk test in flink runner.

52827b7

github-actions bot added python and removed python labels Apr 16, 2025

lostluck approved these changes Apr 16, 2025

View reviewed changes

lostluck merged commit 4ead940 into apache:master Apr 16, 2025
108 checks passed

This was referenced Apr 18, 2025

The PostCommit Python ValidatesRunner Spark job is flaky #30645

Closed

The PostCommit Python ValidatesRunner Samza job is flaky #30657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inject SDK-side flattens while handling input/output coder mismatch in flattens. #34641

Inject SDK-side flattens while handling input/output coder mismatch in flattens. #34641

shunping commented Apr 16, 2025 •

edited

Loading

github-actions bot commented Apr 16, 2025

lostluck left a comment

lostluck Apr 16, 2025

lostluck Apr 16, 2025

Inject SDK-side flattens while handling input/output coder mismatch in flattens. #34641

Inject SDK-side flattens while handling input/output coder mismatch in flattens. #34641

Conversation

shunping commented Apr 16, 2025 • edited Loading

github-actions bot commented Apr 16, 2025

lostluck left a comment

Choose a reason for hiding this comment

lostluck Apr 16, 2025

Choose a reason for hiding this comment

lostluck Apr 16, 2025

Choose a reason for hiding this comment

shunping commented Apr 16, 2025 •

edited

Loading