Skip to content

Redis Cluster Client Deadlock with custom SocketAddressResolver #3240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
henry701 opened this issue Apr 2, 2025 · 6 comments · May be fixed by #3243
Open

Redis Cluster Client Deadlock with custom SocketAddressResolver #3240

henry701 opened this issue Apr 2, 2025 · 6 comments · May be fixed by #3243
Labels
status: mre-available Minimal Reproducible Example is available status: waiting-for-triage

Comments

@henry701
Copy link

henry701 commented Apr 2, 2025

Bug Report

Reproduction Conditions

The deadlock occurs when attempting to connect to a Redis cluster when an exception is thrown from the SocketAddressResolver.

I'm using Spring Data, but this error is reproducible without using it as well.

Environment

  • Lettuce version(s): 6.5.5.RELEASE
  • Redis version: n/a (does not connect, nor needs to)

Logger Warning Stack Trace

The error that is silently caught but not properly handled:

java.lang.IllegalArgumentException: Cannot parse port number: $(INVALID_DATA):CONFIG
  at io.lettuce.core.internal.HostAndPort.parse(HostAndPort.java:98)
  at io.lettuce.core.internal.HostAndPort.of(HostAndPort.java:56)
  at io.lettuce.core.resource.MappingSocketAddressResolver.resolve(MappingSocketAddressResolver.java:97)
  at io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh.openConnections(DefaultClusterTopologyRefresh.java:312)
  at io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh.loadViews(DefaultClusterTopologyRefresh.java:99)
  at io.lettuce.core.cluster.RedisClusterClient.fetchPartitions(RedisClusterClient.java:1033)
  at io.lettuce.core.cluster.RedisClusterClient.loadPartitionsAsync(RedisClusterClient.java:985)
  at io.lettuce.core.cluster.RedisClusterClient.initializePartitions(RedisClusterClient.java:940)
  at io.lettuce.core.cluster.RedisClusterClient.getPartitions(RedisClusterClient.java:332)
  at org.springframework.data.redis.connection.lettuce.ClusterConnectionProvider.getConnectionAsync(ClusterConnectionProvider.java:100)
  at org.springframework.data.redis.connection.lettuce.ClusterConnectionProvider.getConnectionAsync(ClusterConnectionProvider.java:44)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider.getConnection(LettuceConnectionProvider.java:53)
  at org.springframework.data.redis.connection.lettuce.LettucePoolingConnectionProvider.lambda$getConnection$0(LettucePoolingConnectionProvider.java:93)
  at io.lettuce.core.support.ConnectionPoolSupport$RedisPooledObjectFactory.create(ConnectionPoolSupport.java:211)
  at io.lettuce.core.support.ConnectionPoolSupport$RedisPooledObjectFactory.create(ConnectionPoolSupport.java:201)
  at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:71)
  at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:566)
  at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:306)
  at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:233)
  at io.lettuce.core.support.ConnectionPoolSupport$1.borrowObject(ConnectionPoolSupport.java:122)
  at io.lettuce.core.support.ConnectionPoolSupport$1.borrowObject(ConnectionPoolSupport.java:117)
  at org.springframework.data.redis.connection.lettuce.LettucePoolingConnectionProvider.getConnection(LettucePoolingConnectionProvider.java:99)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.getConnection(LettuceConnectionFactory.java:1724)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getNativeConnection(LettuceConnectionFactory.java:1528)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.lambda$getConnection$0(LettuceConnectionFactory.java:1508)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.doInLock(LettuceConnectionFactory.java:1469)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1505)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedClusterConnection(LettuceConnectionFactory.java:1205)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getClusterConnection(LettuceConnectionFactory.java:1016)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getConnection(LettuceConnectionFactory.java:994)
  at org.springframework.data.redis.core.RedisConnectionUtils.fetchConnection(RedisConnectionUtils.java:195)
  at org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:144)
  at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:105)
  at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:383)
  at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:363)
  at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:350)
  at com.my.corporate.service.package.data.MyCorporateClass.init(MyCorporateClass.java:75)

Thread Dump Stack Trace

"main" #1 prio=5 os_prio=0 cpu=11312,57ms elapsed=59,27s tid=0x0000716cd00368f0 nid=0xc3d5f waiting on condition  [0x0000716cd750e000]
   java.lang.Thread.State: WAITING (parking)
  at jdk.internal.misc.Unsafe.park(java.base@17.0.14/Native Method)
  - parking to wait for  <0x000000061caf4d58> (a java.util.concurrent.CompletableFuture$Signaller)
  at java.util.concurrent.locks.LockSupport.park(java.base@17.0.14/LockSupport.java:211)
  at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.14/CompletableFuture.java:1864)
  at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.14/ForkJoinPool.java:3476)
  at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.14/ForkJoinPool.java:3447)
  at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.14/CompletableFuture.java:1898)
  at java.util.concurrent.CompletableFuture.get(java.base@17.0.14/CompletableFuture.java:2072)
  at io.lettuce.core.cluster.RedisClusterClient.get(RedisClusterClient.java:961)
  at io.lettuce.core.cluster.RedisClusterClient.getPartitions(RedisClusterClient.java:332)
  at org.springframework.data.redis.connection.lettuce.ClusterConnectionProvider.getConnectionAsync(ClusterConnectionProvider.java:100)
  at org.springframework.data.redis.connection.lettuce.ClusterConnectionProvider.getConnectionAsync(ClusterConnectionProvider.java:44)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider.getConnection(LettuceConnectionProvider.java:53)
  at org.springframework.data.redis.connection.lettuce.LettucePoolingConnectionProvider.lambda$getConnection$0(LettucePoolingConnectionProvider.java:93)
  at org.springframework.data.redis.connection.lettuce.LettucePoolingConnectionProvider$$Lambda$1849/0x0000716c4cd71300.get(Unknown Source)
  at io.lettuce.core.support.ConnectionPoolSupport$RedisPooledObjectFactory.create(ConnectionPoolSupport.java:211)
  at io.lettuce.core.support.ConnectionPoolSupport$RedisPooledObjectFactory.create(ConnectionPoolSupport.java:201)
  at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:71)
  at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:566)
  at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:306)
  at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:233)
  at io.lettuce.core.support.ConnectionPoolSupport$1.borrowObject(ConnectionPoolSupport.java:122)
  at io.lettuce.core.support.ConnectionPoolSupport$1.borrowObject(ConnectionPoolSupport.java:117)
  at org.springframework.data.redis.connection.lettuce.LettucePoolingConnectionProvider.getConnection(LettucePoolingConnectionProvider.java:99)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.getConnection(LettuceConnectionFactory.java:1724)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getNativeConnection(LettuceConnectionFactory.java:1528)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.lambda$getConnection$0(LettuceConnectionFactory.java:1508)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection$$Lambda$1847/0x0000716c4cd70e80.get(Unknown Source)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.doInLock(LettuceConnectionFactory.java:1469)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1505)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedClusterConnection(LettuceConnectionFactory.java:1205)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getClusterConnection(LettuceConnectionFactory.java:1016)
  at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getConnection(LettuceConnectionFactory.java:994)
  at org.springframework.data.redis.core.RedisConnectionUtils.fetchConnection(RedisConnectionUtils.java:195)
  at org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:144)
  at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:105)
  at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:383)
  at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:363)
  at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:350)
  at com.my.corporate.service.package.data.MyCorporateClass.init(MyCorporateClass.java:75)

Reproduction

I've created a repository here with the repro of this bug: https://github.com/henry701/lettuce-bug-report-infiwait

Pretty straightforward, just mvn compile exec:java and the application will compile and hang.

Investigation

After analyzing the code execution flow together with the stack trace and a thread dump, the following was determined:

  1. The application gets stuck indefinitely in getPartitions() which is waiting for a promise returned by initializePartitions()
  2. The Future returned by loadPartitionsAsync() is never completed
  3. This future depends on another future returned in loadViews(), which is also never completed
  4. The loadViews() future depends on a future from the ConnectionTracker class passed to the openConnections() method
  5. Root cause: In the error handling path, there's a logger warning that wraps everything, but it doesn't call tracker.addConnection(redisURI, sync) when errors occur in certain flows
  6. Specifically, the error happens at SocketAddress socketAddress = clientResources.socketAddressResolver().resolve(redisURI) when calling a custom implementation of SocketAddressResolver, which fails due to an invalid port format configuration in my case.

The issue lies in the openConnections method, where exceptions during address resolution are caught and logged, but the sync CompletableFuture is not properly completed or added to the tracker in this error path.

Possible Solution

Move the declaration of CompletableFuture<StatefulRedisConnection<String, String>> sync = new CompletableFuture<>() to the top of the loop in the openConnections method, and modify the catch block to contain:

catch (RuntimeException e) {
    String message = String.format("Unable to connect to [%s]", redisURI);
    logger.warn(message, e);
    sync.completeExceptionally(new RedisConnectionException(message, e));
    tracker.addConnection(redisURI, sync);
}

This ensures that even when there's a connection error, the future is properly completed exceptionally and added to the ConnectionTracker, completing its future exceptionally and preventing the deadlock.

Additional Context

This bug causes applications to hang indefinitely when attempting to connect to a Redis cluster with invalid node configuration, which can occur in various scenarios such as:

  • Misconfigured environment variables
  • Incorrect template substitution in configuration files
  • Network or DNS resolution issues

The thread is stuck parking forever unless interrupted. As a result when ran from the main thread or from a thread which is waited upon such as a web thread, the application becomes unresponsive and requires continuous restarts to recover.

@tishun tishun added status: waiting-for-triage status: mre-available Minimal Reproducible Example is available labels Apr 3, 2025
@tishun
Copy link
Collaborator

tishun commented Apr 3, 2025

😲 This is - by far - the best issue report that I have ever read.

Huge appreciation for all the work done to analyse and document the issue!

@henry701
Copy link
Author

henry701 commented Apr 3, 2025

Update - While concoting workarounds, same issue was observed outside of the custom logic, because MappingSocketAddressResolver does this: HostAndPort hostAndPort = HostAndPort.of(redisURI.getHost(), redisURI.getPort());
(which fails with an exception in case of invalid port number syntax and falls in the same scenario)

@henry701
Copy link
Author

henry701 commented Apr 3, 2025

Managed to work around the issue by copying the MappingSocketAddressResolver into a CustomMappingSocketAddressResolver class and ensuring it never throws. Instead of throwing, it logs an error and returns a dummy invalid hostname and port combination inside the network address. Works well enough for me until an official fix.

I can open a PR during the weekend to fix and test this scenario if you want :)

@tishun
Copy link
Collaborator

tishun commented Apr 3, 2025

I can open a PR during the weekend to fix and test this scenario if you want :)

Sure, if you have the code we'd love to see it contributed!

@gabsgomesabi
Copy link

you rock @henry701

@henry701
Copy link
Author

henry701 commented Apr 5, 2025

@tishun PR opened, cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: mre-available Minimal Reproducible Example is available status: waiting-for-triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants