Add metrics for tracking total disconnected time and reconnection attempts #3220

ggivo · 2025-03-18T08:58:36Z

Description:
Introduces two new metrics to track the total time a connection remains disconnected until it is successfully reconnected and the number of reconnection attempts. The changes include:

New Metrics:
- lettuce.reconnection.inactive.duration
  - Description: Measures the time taken for a successful reconnection after a disconnection.
  - Type: Timer
- lettuce.reconnection.attempts
  - Description: Tracks the number of reconnection attempts made during a disconnection.
  - Implementation: Counter

Impact:

Provides better insights into connection stability and reconnection performance.

ggivo · 2025-03-20T05:38:19Z

Considering adding a few more metrics
e.g

endpoint.command.queue - Gauge - tracks size of inflight commands (commands written to netty queue but not yet completed)
endpoint.command.buffer. - Gauge - tracks size of buffered commands (when autoflush=false)
endpoint.disconnected.buffer - Gauge - tracks size of buffered commands during disconnect

Those are implementation-specific and relevant to DefaultEndpoint.

This raises some open questions:
As of now, we have CommandLatencyRecorder responsible for gathering command latency metrics,

Do we continue with separate Metrics recorders, for example, one for ConnectionMonitoring (used by ConnectionWatchdog to track inactive connection time) and another for DefaultEndpoint (tracking the size of internal queues)... or have a single MetricsRecorder for both (ConnectionMonitoring, DefaultEndpoint)?

Do we want to enable/disable only connection-related, and endpoint-related metrics separately?

@tishun any opinion

…Polishing

…ent/RedisClusterClient was

Seems like AttributeMap.attr is not accurate and actually return's null causing some unit test failures.

This commit adds the ability to listen for MIGRATING and MIGRATED messages and trigger extended command expiry timeouts during Redis shard migration. Key changes: - Enhanced RebindAwareConnectionWatchdog to detect MIGRATING/MIGRATED messages - RebindAwareExpiryWriter to trigger timeout relaxation whenever MIGRATING message is received This feature allows commands to have relaxed timeouts during shard migration operations, preventing unnecessary timeouts when Redis is temporarily busy with migration tasks.

…ng one from re-bind

ggivo force-pushed the lettuce-observability branch from d41a0f7 to e3fc47d Compare March 18, 2025 09:00

ggivo marked this pull request as draft March 20, 2025 05:22

ggivo force-pushed the lettuce-observability branch 3 times, most recently from 8914047 to 37492c4 Compare March 25, 2025 18:10

ggivo force-pushed the lettuce-observability branch from b672564 to bdb66b4 Compare April 14, 2025 14:44

ggivo force-pushed the lettuce-observability branch 2 times, most recently from f4b049f to baa5183 Compare May 12, 2025 16:32

tishun and others added 21 commits May 19, 2025 14:34

v0.1

1938b57

Simple reconnect now working

f7723c8

Bind address from message is now considered

97761fe

Self-register the handler

73f3834

Format code

cad3370

Filter push messages in a more stable way

e53dbf6

(very hacky) Relax comand expire timers globbaly

04112c2

Configure if timeout relaxing should be applied

fc47155

Proper way to close channel

bd85245

Configure the timneout relaxing

c397419

Sequential handover implemented

465e249

Did not address formatting

b74a168

Prolong the rebind windwow for relaxed tiemouts

bb1dd67

PubSub no longer required; CommandExpiryWriter is now channel aware; …

b5b2118

…Polishing

Use the new MOVING push message from the RE server

81f783f

Unit test was not chaining delgates in the same way that the RedisCli…

b86dfec

…ent/RedisClusterClient was

Fix REBIND message validation

b09a297

Fixed the expiry mechanism

f94cff0

Polishing

2bb04e8

Fix NPE.

27094c4

Seems like AttributeMap.attr is not accurate and actually return's null causing some unit test failures.

formating

2610739

ggivo force-pushed the lettuce-observability branch from 8ea2520 to 8adf0ad Compare June 12, 2025 16:55

ggivo added 3 commits June 13, 2025 18:13

Fix Disabling relaxTimeouts after upgrade can interfere with an ongoi…

e0e0a60

…ng one from re-bind

Additional fix for timeout relaxing disabled

e8efa02

Fix push message listener registered multiple times after rebind.

392c406

ggivo force-pushed the lettuce-observability branch from 5153ac3 to 392c406 Compare June 16, 2025 15:21

Fix: Report correct command timeout when relaxTimeout is configured

d1869c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add metrics for tracking total disconnected time and reconnection attempts #3220

Add metrics for tracking total disconnected time and reconnection attempts #3220

Uh oh!

ggivo commented Mar 18, 2025 •

edited

Loading

Uh oh!

ggivo commented Mar 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add metrics for tracking total disconnected time and reconnection attempts #3220

Are you sure you want to change the base?

Add metrics for tracking total disconnected time and reconnection attempts #3220

Uh oh!

Conversation

ggivo commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggivo commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggivo commented Mar 18, 2025 •

edited

Loading

ggivo commented Mar 20, 2025 •

edited

Loading