Skip to content

[service/internal/graph] Measure telemetry as it is passed between pipeline components #12812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

djaglowski
Copy link
Member

@djaglowski djaglowski commented Apr 8, 2025

Depends on #12856

Resolves #12676

This is a reboot of #11311, incorporating metrics defined in the component telemetry RFC and attributes added in #12617.

The basic pattern is:

  • When building any pipeline component which produces data, wrap the "next consumer" with instrumentation to measure the number of items being passed. This wrapped consumer is then passed into the constructor of the component.
  • When building any pipeline component which consumes data, wrap the component itself. This wrapped consumer is saved onto the graph node so that it can be retrieved during graph assembly.

TODO:

  • There are no tests which directly validate the presence or content of the metrics. I've done some manual validation and believe things are working correctly. I'll work on unit tests soon but would appreciate feedback on the PR regardless.

Next Steps:

  • This defines but does not implement the "size" metrics. (It both defines and implements the item count metrics.)

@djaglowski djaglowski changed the title Compiling and tests passing [service/internal/graph] Record normalized telemetry as it is passed between pipeline components Apr 8, 2025
@djaglowski djaglowski changed the title [service/internal/graph] Record normalized telemetry as it is passed between pipeline components [service/internal/graph] Measure telemetry as it is passed between pipeline components Apr 8, 2025
@djaglowski djaglowski force-pushed the pipeline-component-metrics branch 2 times, most recently from b6bb02d to 2f83f2b Compare April 8, 2025 19:20
Copy link

codecov bot commented Apr 8, 2025

Codecov Report

Attention: Patch coverage is 86.89655% with 57 lines in your changes missing coverage. Please review.

Project coverage is 91.56%. Comparing base (d020c90) to head (d882a04).

Files with missing lines Patch % Lines
service/internal/graph/connector.go 63.90% 32 Missing and 16 partials ⚠️
service/internal/graph/exporter.go 85.71% 2 Missing and 1 partial ⚠️
service/internal/graph/processor.go 89.65% 2 Missing and 1 partial ⚠️
service/internal/graph/receiver.go 75.00% 2 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (86.89%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12812      +/-   ##
==========================================
- Coverage   91.65%   91.56%   -0.10%     
==========================================
  Files         499      499              
  Lines       27426    27801     +375     
==========================================
+ Hits        25138    25456     +318     
- Misses       1809     1847      +38     
- Partials      479      498      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@jaronoff97 jaronoff97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial thoughts, overall implementation seems very sane to me though. Thank you!

@djaglowski djaglowski force-pushed the pipeline-component-metrics branch from 2f83f2b to 052cffc Compare April 8, 2025 20:36
@djaglowski djaglowski force-pushed the pipeline-component-metrics branch from 052cffc to 362a610 Compare April 9, 2025 16:39
@djaglowski djaglowski force-pushed the pipeline-component-metrics branch 2 times, most recently from fbed573 to b789109 Compare April 9, 2025 20:24
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving this work forward @djaglowski, just a few questions

Comment on lines +62 to +65
obsConsumer := obsconsumer.NewTraces(fanoutconsumer.NewTraces(consumers), tb.ReceiverProducedItems)
n.Component, err = builder.CreateTraces(ctx, set, obsConsumer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocking comment:

Suggested change
obsConsumer := obsconsumer.NewTraces(fanoutconsumer.NewTraces(consumers), tb.ReceiverProducedItems)
n.Component, err = builder.CreateTraces(ctx, set, obsConsumer)
n.Component, err = builder.CreateTraces(ctx, set, obsconsumer.NewTraces(fanoutconsumer.NewTraces(consumers), tb.ReceiverProducedItems))

@djaglowski
Copy link
Member Author

Thanks for the reviews.

#12817 implements a subset of this PR. If we can get that merged in first, I'll rebase to reduce the scope of this one substantially.

Copy link
Member

@MikeGoldsmith MikeGoldsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks really good 👍🏻

As shared in the other PR, I'd really like to see bytes counters too but this is a great start.

github-merge-queue bot pushed a commit that referenced this pull request Apr 15, 2025
Subset of #12812

This internal package defines wrappers around consumers. These are
useful for instrumenting the component graph, so that we can generate
telemetry describing data as it is passed in between components.

Currently, this supports only a single counter metric, but in the near
future it can be enhanced to automatically capture multiple metrics
(e.g. item count & size), and potentially spans and/or logs as well.
@djaglowski djaglowski force-pushed the pipeline-component-metrics branch from b789109 to 4afe6cd Compare April 15, 2025 13:13
@djaglowski djaglowski force-pushed the pipeline-component-metrics branch from 4afe6cd to 5bdd398 Compare April 15, 2025 14:29
@github-actions github-actions bot requested a review from dmathieu April 15, 2025 14:29
@djaglowski djaglowski force-pushed the pipeline-component-metrics branch 4 times, most recently from adec160 to b66a5a8 Compare April 17, 2025 14:47
@djaglowski djaglowski force-pushed the pipeline-component-metrics branch 2 times, most recently from 4bfa6c4 to c03ad2c Compare April 22, 2025 22:22
@djaglowski djaglowski force-pushed the pipeline-component-metrics branch from c03ad2c to 1a1d65a Compare April 22, 2025 22:32
@djaglowski djaglowski closed this Apr 22, 2025
@djaglowski djaglowski reopened this Apr 22, 2025
@djaglowski djaglowski marked this pull request as ready for review April 23, 2025 17:00
@djaglowski djaglowski requested a review from a team as a code owner April 23, 2025 17:00
@djaglowski
Copy link
Member Author

I am pretty stumped on the datadog exporter integration test failure. I don't see anything explicitly indicating that it is related to this change but it does appear to fail consistently on this PR. The panic indicates a relation to prometheus, which made me suspect that the new metric naming format (using . delimiter`) could be the culprit. However, I tested this by temporarily switching to the old metric name format and observed that it failed in the same way. (836b2b3)

@mx-psi is there anyone at DataDog that might be able to provide more insight into this test?

Comment on lines +89 to +92
if err != nil {
return fmt.Errorf("failed to create %q processor, in pipeline %q: %w", set.ID, n.pipelineID.String(), err)
}
n.consumer = obsconsumer.NewProfiles(n.Component.(xconsumer.Profiles), tb.ProcessorConsumedItems)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be done once after the switch statement

Comment on lines +124 to +171
receiver.produced.size:
prefix: otelcol.
enabled: false
description: Size of items emitted from the receiver.
unit: "By"
sum:
value_type: int
monotonic: true
processor.consumed.size:
prefix: otelcol.
enabled: false
description: Size of items passed to the processor.
unit: "By"
sum:
value_type: int
monotonic: true
processor.produced.size:
prefix: otelcol.
enabled: false
description: Size of items emitted from the processor.
unit: "By"
sum:
value_type: int
monotonic: true
connector.consumed.size:
prefix: otelcol.
enabled: false
description: Size of items passed to the connector.
unit: "By"
sum:
value_type: int
monotonic: true
connector.produced.size:
prefix: otelcol.
enabled: false
description: Size of items emitted from the connector.
unit: "By"
sum:
value_type: int
monotonic: true
exporter.consumed.size:
prefix: otelcol.
enabled: false
description: Size of items passed to the exporter.
unit: "By"
sum:
value_type: int
monotonic: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the size metrics aren't used yet, I'd suggest removing these from the PR

@codeboten
Copy link
Contributor

I am pretty stumped on the datadog exporter integration test failure. I don't see anything explicitly indicating that it is related to this change but it does appear to fail consistently on this PR. The panic indicates a relation to prometheus, which made me suspect that the new metric naming format (using . delimiter`) could be the culprit. However, I tested this by temporarily switching to the old metric name format and observed that it failed in the same way. (836b2b3)

@djaglowski not sure if it's helpful, but I ran into a similar issue back in the PR that was updating the collector to use the otel config package from otel-go and @songy23 commented on the failure: #11611 (comment)

@codeboten
Copy link
Contributor

I built the collector using the branch and when running it, this is what I see at localhost:8888/metrics:

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "otelcol_processor_batch_metadata_cardinality" { label:{name:"otel_scope_name" value:"go.opentelemetry.io/collector/processor/batchprocessor"} label:{name:"otel_scope_version" value:""} label:{name:"processor" value:"batch"} gauge:{value:1}} was collected before with the same name and label values
* collected metric "otelcol_processor_batch_metadata_cardinality" { label:{name:"otel_scope_name" value:"go.opentelemetry.io/collector/processor/batchprocessor"} label:{name:"otel_scope_version" value:""} label:{name:"processor" value:"batch"} gauge:{value:1}} was collected before with the same name and label values

@codeboten
Copy link
Contributor

testing this further, this appears to be the current behaviour in main, it looks like something broke in the last 20 commits

@codeboten
Copy link
Contributor

Looks like things have been unhappy since this change: dc8e2dd

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR LGTM. However, I'd like to get an agreement on #12916 before releasing this

@songy23
Copy link
Member

songy23 commented Apr 24, 2025

I built the collector using the branch and when running it, this is what I see at localhost:8888/metrics:

I just tested and saw the same, internal metrics are broken at mainline head.

@songy23
Copy link
Member

songy23 commented Apr 24, 2025

Looks like things have been unhappy since this change: dc8e2dd

Yes, reverting that commit fixes it http://github.com/open-telemetry/opentelemetry-collector/pull/12917

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement pipeline instrumentation
7 participants