Description
What is Sampling?
Sampling is a statistical method employed to choose a subgroup that can effectively represent the entire population and be readily extrapolated.
Why we need Sampling?
As we consider the instrumentation of generic constructs such as rest controllers, transport actions, and task managers, a notable issue arises. Our system encompasses more than 200+ transport actions and a variety of rest actions, leading to the potential generation of numerous spans. However, each span incurs a non-negligible cost. Hence, it is crucial to develop a Sampling strategy that enables us to select a subset of spans from the total pool, effectively representing the entire population or system.
Different Sampling Options
There are two distinct sampling options available in OpenTelemetry.
Head Sampling: Head sampling takes place just before the creation of a span. Since the decision needs to be made as early as possible, it relies on arbitrary factors like randomly selecting a percentage of spans. While this is a straightforward technique to implement, it lacks request/trace-specific data, which may limit its ability to make intelligent decisions.
Tail Sampling: In contrast to head sampling, tail sampling makes its sampling decision at the end of the entire trace when data from all spans is available. This type of sampling occurs at the collector level. Here, sampling can be based on various criteria such as latency, attribute values, status (e.g., error, success), etc. One major advantage is static stability when we are especially measuring the resource consumption. One challenge with this strategy is that it necessitates having all spans sent to a single collector, which could be problematic for distributed systems like OpenSearch where sending all spans to a specific data node might be challenging. In the following sections, we will explore different strategies related to these sampling methods.
OpenSearch Sampling Requirements
To determine the most suitable sampling strategy, it is essential to consider the specific requirements and goals for implementing tracing in the system. Here are some common scenarios:
- Debugging a Specific Request: When tracing is primarily intended for debugging a particular request, it is advisable to capture 100% of the data related to that specific request. This level of detailed tracing allows for a comprehensive examination of the request's behavior.
- Debug the issues: For debugging failures, 4xx, 5xx, etc. we need to sample the 100% traffic from the Head and later on sample spans based on some attributes like status, attributes, resource consumption etc in the tail sampling.
- Debugging System-Wide Issues: For debugging system-wide issues, capturing 10-50% of the traces is typically sufficient. This level of sampling enables the identification and understanding of failures, errors, and latencies across the entire system without overwhelming the tracing infrastructure.
- Application Baseline: If the goal is to observe the application's normal behavior and ensure it is functioning as expected, a minimal sampling rate of 5-10% of the traces should be enough. This provides a representative sample of the application's regular performance.
- Tracing Code Paths and Architecture: When the focus is on tracing overall code paths and understanding the system's architecture, a very minimal tracing approach should be adequate. In this case, a low sampling rate can still provide valuable insights without excessive data collection.
By carefully considering the specific use cases and objectives of tracing, a suitable sampling strategy can be chosen to strike the right balance between capturing enough data for analysis while minimizing the impact on performance and storage resources.
OpenSearch sampling strategy
For a distributed system like OpenSearch, an effective sampling strategy often combines both head and tail sampling techniques to achieve the desired tracing goals. Let's delve into the details of the proposed sampling strategy for OpenSearch:
Head Based OpenSearch sampling
- OpenSearch, being a distributed system, contains numerous Transport and Rest actions, all of which need to be instrumented by default with a minimal sampling rate. However, specific critical codepaths, like Search and Indexing, could be sampled at a higher rate for more in-depth analysis.
- The sampling rates should be configurable on-the-fly through settings, allowing flexibility in adjusting the sampling percentages based on requirements.
- The strategy should enable the ability to disable tracing for certain transport actions, such as health checks, to reduce unnecessary overhead.
- Sampling rates could be configured using a following schema that accommodates various use cases and performance considerations.
{
"action_strategies": [
{
"action": "search_action",
"type": "probabilistic",
"param": 1.0
},
{
"action": "bulk_action",
"type": "probabilistic",
"param": 1.0
},
{
"action": "internal:coordination/fault_detection/follower_check",
"type": "probabilistic",
"param": 0.0
},
{
"action": "internal:coordination/fault_detection/leader_check",
"type": "probabilistic",
"param": 0.0
}
],
"default_strategy": {
"type": "probabilistic",
"param": 0.001
}
}
Tail Sampling
As most requests in OpenSearch return 200 OK responses and stay within Service Level Agreements (SLAs), not all of these requests need to be traced. Tail sampling becomes valuable in this context.
- OpenSearch, being a distributed system, contains numerous Transport and Rest actions, all of which need to be instrumented by default with a minimal sampling rate. However, specific critical codepaths, like Search and Indexing, could be sampled at a higher rate for more in-depth analysis.
- Tail sampling would involve sampling traces based on specific parameters to filter out traces that are less relevant for analysis, thereby conserving resources.
- Implementing tail sampling can be challenging, as it requires sending all spans belonging to a single trace to a particular collector.
Options for Tail Sampling in OpenSearch:
-
Send Spans as Part of Response: Spans are sent as part of the response back to the coordinator node, which can then export these spans to the local collector. The overall impact on performance should be assessed through performance runs.
Pros:
- Simple to achieve.
- Uniform distribution of requests should ensure a similar number of spans per collector/coordinator.
Cons:
- Increased response size due to added span data.
- Managing the lifecycle of a span may introduce some complexity.
-
Export Span to Coordinator Node Always: A custom span exporter can be created to directly export spans from the OpenSearch core to the collector running on the coordinator node. This might require opening the GRPC port for node intercommunication.
Pros:
- Spans are not sent as part of the response, reducing response size.
- Easy to identify and propagate the coordinator IP from the request to child spans.
Cons:
- Harder to debug and monitor.
- Increased data network utilization for internode communication.
-
Export Span to Local Collector: Spans are initially exported to a local collector before being distributed to a single collector that handles all spans belonging to a particular trace. This could involve multiple levels of collectors.
Pros:
- Utilizes out-of-the-box capabilities from OpenTelemetry (otel).
- Distribution of spans occurs outside the OpenSearch process.
Cons:
- Every collector needs to be aware of the other nodes, introducing some complexity.
- Handling corner cases, such as node failures, requires careful consideration.
In conclusion, tail sampling in OpenSearch requires careful consideration as there is no clear winner among the options discussed. Option 2 and Option 3 are already configurable in otel. Option 1 may introduce resource overhead and require feasibility test. Looking forward to feedback from the community on these tail sampling options.
Other limiting factors
Indeed, there are several other limiting factors and considerations when implementing tracing in a distributed system like OpenSearch:
- Overuse of Spans: Each span comes with a cost, both in terms of performance overhead and storage requirements. Therefore, it's crucial to exercise caution when adding spans. Overusing tracing can lead to a significant impact on overall system performance and resource utilization.
- Limiting Horizontally: Sampling techniques play a vital role in limiting the number of traces that are captured and recorded. By intelligently sampling requests and spans, the system can avoid an overwhelming amount of tracing data while still obtaining valuable insights.
- Limiting Vertically with Levels: Implementing span levels can be an effective way to limit the number of spans per trace. By defining levels of detail, the system can control the depth and granularity of the tracing information collected for each request, allowing more focused analysis when needed.
- Max Spans: Enforcing limits on the number of spans per unit of time (e.g., per minute) is another useful approach to prevent excessive tracing. Setting a maximum number of spans ensures that the tracing infrastructure doesn't get overwhelmed during peak usage periods. This may result into the partial traces.
By thoughtfully considering these limiting factors and incorporating the appropriate sampling techniques, level definitions, and span limits, the tracing implementation in OpenSearch can strike the right balance between capturing sufficient data for analysis and maintaining a performant and efficient distributed system. I am doing POC with couple of approaches and will update the results here.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status