Description
Describe the bug
We are seeing NPEs coming up from the NodesStats API when there are nodes data nodes dropping out of the cluster because of resource constraints. NodesStats API fired at that point of time when there are shards getting unassigned from the data nodes (because of data nodes leaving the cluster), fails with this error:
java.lang.NullPointerException: Cannot invoke "org.opensearch.cluster.routing.AllocationId.getId()" because the return value of "org.opensearch.cluster.routing.ShardRouting.allocationId()" is null
at org.opensearch.index.seqno.ReplicationTracker.lambda$isPrimaryRelocation$18(ReplicationTracker.java:1246)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1602)
at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:129)
at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:527)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:513)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:150)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:652)
at org.opensearch.index.seqno.ReplicationTracker.isPrimaryRelocation(ReplicationTracker.java:1247)
at org.opensearch.index.seqno.ReplicationTracker.lambda$getSegmentReplicationStats$23(ReplicationTracker.java:1318)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
at java.base/java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1850)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
at org.opensearch.index.seqno.ReplicationTracker.getSegmentReplicationStats(ReplicationTracker.java:1321)
at org.opensearch.index.shard.IndexShard.getReplicationStatsForTrackedReplicas(IndexShard.java:3120)
at org.opensearch.index.shard.IndexShard.getReplicationStats(IndexShard.java:3125)
at org.opensearch.index.shard.IndexShard.segmentStats(IndexShard.java:1500)
at org.opensearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:228)
at org.opensearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:146)
at org.opensearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:66)
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:495)
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:469)
at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456)
at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceivedDecorate(SecuritySSLRequestHandler.java:224)
at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:323)
at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:172)
at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:797)
at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:113)
at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43)
at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:471)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
This seems to be coming up from the SegmentReplicationStats code, specifically from this code block which tries to detect if a primary shard is being relocated by cross checking the current allocationId
with all the allocationIds
from the shard routing table.
OpenSearch/server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java
Lines 1233 to 1239 in 774e7d4
Related component
Storage
To Reproduce
N/A
Expected behavior
NodesStats API should not fail even during transient data node drops
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Type
Projects
Status