Implement randomizationFactor in ExponentialBackOffWithMaxRetries #3849

JoosJuliet · 2025-04-17T09:14:14Z

hello

Expected Behavior

The ExponentialBackOffWithMaxRetries class should allow for randomized backoff intervals to prevent simultaneous retries from multiple instances, which can overload servers. This randomness should be adjustable via a randomizationFactor to provide flexibility in how the backoff intervals are calculated.

Current Behavior

Currently, the ExponentialBackOffWithMaxRetries class calculates backoff intervals based solely on fixed exponential factors without any randomness. This can lead to predictable and synchronized retries in distributed systems, potentially causing spikes in load and collision risks.

Context

To enhance the stability and fairness of our Kafka message processing system, we need to address a key challenge: consumer pod starvation caused by synchronized retries when using RecoveringBatchErrorHandler. The predictable nature of the current retry intervals is the root cause of this synchronization, leading to uneven load distribution.

The core solution is introducing a randomizationFactor within the ExponentialBackOffWithMaxRetries class. This element of randomness adds necessary jitter, fine-tuning each pod's retry interval to prevent simultaneous attempts and distribute them over time.

Consequently, this enhancement will effectively prevent pod starvation and ensure message consumption opportunities are distributed more evenly across all nodes. This stabilizes system load and significantly improves the resilience of our Kafka processing architecture.

Proposed Code Changes

Here is the proposed enhancement to the ExponentialBackOffWithMaxRetries class

public class ExponentialBackOffWithMaxRetries extends ExponentialBackOff {

    private final int maxRetries;
    private double randomizationFactor = 0.0; // Default to no randomization for backward compatibility

    public ExponentialBackOffWithMaxRetries(int maxRetries) {
        this.maxRetries = maxRetries;
        calculateMaxElapsed();
    }

    public void setRandomizationFactor(double randomizationFactor) {
        if (randomizationFactor < 0 || randomizationFactor > 1) {
            throw new IllegalArgumentException("Randomization factor must be between 0 and 1");
        }
        this.randomizationFactor = randomizationFactor;
        calculateMaxElapsed();
    }

    private void calculateMaxElapsed() {
        long maxInterval = getMaxInterval();
        long maxElapsed = Math.min(getInitialInterval(), maxInterval);
        long current = maxElapsed;
        for (int i = 1; i < this.maxRetries; i++) {
            long next = Math.min((long) (current * getMultiplier()), maxInterval);
            current = applyRandomization(next);
            maxElapsed += current;
        }
        super.setMaxElapsedTime(maxElapsed);
    }

    private long applyRandomization(long interval) {
        double randomMultiplier = (1 - randomizationFactor) + Math.random() * 2 * randomizationFactor;
        return (long) (interval * randomMultiplier);
    }
}

The text was updated successfully, but these errors were encountered:

sobychacko · 2025-04-17T14:19:43Z

@JoosJuliet This sounds like a generic matter and nothing specific to Spring Kafka. I wonder if we can discuss these types of things in the context of the Spring Retry project. cc @artembilan for his insights.

artembilan · 2025-04-17T16:52:14Z

The ExponentialBackOffWithMaxRetries class is based on the ExponentialBackOff from Spring Framework - nothing to do with Spring Retry.
You can inject your own implementation whenever we ask for a BackOff contract.

There is one is Spring Retry, though:

/**
 * Implementation of {@link org.springframework.retry.backoff.ExponentialBackOffPolicy}
 * that chooses a random multiple of the interval that would come from a simple
 * deterministic exponential. The random multiple is uniformly distributed between 1 and
 * the deterministic multiplier (so in practice the interval is somewhere between the next
 * and next but one intervals in the deterministic case). This is often referred to as
 * jitter.
 *
 * This has shown to at least be useful in testing scenarios where excessive contention is
 * generated by the test needing many retries. In test, usually threads are started at the
 * same time, and thus stomp together onto the next interval. Using this
 * {@link BackOffPolicy} can help avoid that scenario.
 *
 * Example: initialInterval = 50 multiplier = 2.0 maxInterval = 3000 numRetries = 5
 *
 * {@link ExponentialBackOffPolicy} yields: [50, 100, 200, 400, 800]
 *
 * {@link ExponentialRandomBackOffPolicy} may yield [76, 151, 304, 580, 901] or [53, 190,
 * 267, 451, 815] (random distributed values within the ranges of [50-100, 100-200,
 * 200-400, 400-800, 800-1600])
 *
 * @author Jon Travis
 * @author Dave Syer
 * @author Chase Diem
 */
@SuppressWarnings("serial")
public class ExponentialRandomBackOffPolicy extends ExponentialBackOffPolicy {

I think the idea of a jitter is OK, we indeed can implement it here as you suggest, but let's see if moving it up to Spring Framework would be much better for a broader number of users.

JoosJuliet · 2025-04-18T02:13:30Z

I went ahead and created the corresponding issue in the Spring Framework repo here: spring-projects/spring-framework#34773.

Let me know if there's anything I can do to help drive it forward or make it easier to review.

artembilan · 2025-04-18T02:23:42Z

Thanks.
Subscribed.
We can decide what to do here after conclusion on that issue.

JoosJuliet · 2025-04-18T02:36:25Z

Based on the direction in spring-framework#22009, it seems that Spring Framework is not planning to include randomization logic directly, and instead recommends using spring-retry for such cases.

That said, since spring-kafka already provides its own support for maxAttempts, perhaps it might be worth considering a similar built-in support for randomization—mainly for consistency within the project.

artembilan · 2025-04-21T13:49:59Z

Left comment on that issue.
Until decision is made there there is no reason to rush here for a possible duplication of API and work at all.

JoosJuliet added status: waiting-for-triage type: enhancement labels Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement randomizationFactor in ExponentialBackOffWithMaxRetries #3849

Implement randomizationFactor in ExponentialBackOffWithMaxRetries #3849

JoosJuliet commented Apr 17, 2025 •

edited

Loading

sobychacko commented Apr 17, 2025

artembilan commented Apr 17, 2025

JoosJuliet commented Apr 18, 2025

artembilan commented Apr 18, 2025

JoosJuliet commented Apr 18, 2025

artembilan commented Apr 21, 2025

Implement randomizationFactor in ExponentialBackOffWithMaxRetries #3849

Implement randomizationFactor in ExponentialBackOffWithMaxRetries #3849

Comments

JoosJuliet commented Apr 17, 2025 • edited Loading

Expected Behavior

Current Behavior

Context

Proposed Code Changes

sobychacko commented Apr 17, 2025

artembilan commented Apr 17, 2025

JoosJuliet commented Apr 18, 2025

artembilan commented Apr 18, 2025

JoosJuliet commented Apr 18, 2025

artembilan commented Apr 21, 2025

JoosJuliet commented Apr 17, 2025 •

edited

Loading