Skip to content

[BUG] Segment Replication stats throwing NPE when shards are unassigned or are in delayed allocation phase #11945

Closed
@shourya035

Description

@shourya035

Describe the bug

We are seeing NPEs coming up from the NodesStats API when there are nodes data nodes dropping out of the cluster because of resource constraints. NodesStats API fired at that point of time when there are shards getting unassigned from the data nodes (because of data nodes leaving the cluster), fails with this error:

java.lang.NullPointerException: Cannot invoke "org.opensearch.cluster.routing.AllocationId.getId()" because the return value of "org.opensearch.cluster.routing.ShardRouting.allocationId()" is null
    at org.opensearch.index.seqno.ReplicationTracker.lambda$isPrimaryRelocation$18(ReplicationTracker.java:1246)
    at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
    at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1602)
    at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:129)
    at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:527)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:513)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:150)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:652)
    at org.opensearch.index.seqno.ReplicationTracker.isPrimaryRelocation(ReplicationTracker.java:1247)
    at org.opensearch.index.seqno.ReplicationTracker.lambda$getSegmentReplicationStats$23(ReplicationTracker.java:1318)
    at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
    at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
    at org.opensearch.index.seqno.ReplicationTracker.getSegmentReplicationStats(ReplicationTracker.java:1321)
    at org.opensearch.index.shard.IndexShard.getReplicationStatsForTrackedReplicas(IndexShard.java:3120)
    at org.opensearch.index.shard.IndexShard.getReplicationStats(IndexShard.java:3125)
    at org.opensearch.index.shard.IndexShard.segmentStats(IndexShard.java:1500)
    at org.opensearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:228)
    at org.opensearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:146)
    at org.opensearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:66)
    at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:495)
    at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:469)
    at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456)
    at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceivedDecorate(SecuritySSLRequestHandler.java:224)
    at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:323)
    at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:172)
    at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:797)
    at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:113)
    at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43)
    at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
    at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:471)
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917)
    at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

This seems to be coming up from the SegmentReplicationStats code, specifically from this code block which tries to detect if a primary shard is being relocated by cross checking the current allocationId with all the allocationIds from the shard routing table.

private boolean isPrimaryRelocation(String allocationId) {
Optional<ShardRouting> shardRouting = routingTable.shards()
.stream()
.filter(routing -> routing.allocationId().getId().equals(allocationId))
.findAny();
return shardRouting.isPresent() && shardRouting.get().primary();
}

Related component

Storage

To Reproduce

N/A

Expected behavior

NodesStats API should not fail even during transient data node drops

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

StorageIssues and PRs relating to data and metadata storagebugSomething isn't workinggood first issueGood for newcomerslow hanging fruit

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions