Skip to content

BugFix: Aggregrate neuroncore utilization across core/device pair #1713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

spanaik
Copy link
Collaborator

@spanaik spanaik commented Jun 3, 2025

Description of the issue

BugFix: The gpuattributes processor currently does not aggregate neuroncore_utilization metrics across Cores and neuron Runtimes, this results in inaccurate utilization metrics per core.

Description of changes

The change aggregates the core utilization values by taking the max utilization across the same core/device pair across runtimes.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Deployed a custom agent to check the behavior.

Before:

image

After:

image

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@spanaik spanaik requested a review from a team as a code owner June 3, 2025 12:34
movence
movence previously approved these changes Jun 9, 2025
@spanaik spanaik requested review from aditya-purang and movence June 9, 2025 21:30
@@ -48,6 +49,8 @@ const (
RuntimeTagOverride = "DEFAULT"
NeuronExecutionErrorsAggregatedMetric = containerinsightscommon.NeuronExecutionErrors + "_total"
NeuronDeviceHardwareEccEventsAggregatedMetric = containerinsightscommon.NeuronDeviceHardwareEccEvents + "_total"
NeuronCoreLabel = "neuroncore"
NeuronCorePerDevice = 2
Copy link
Contributor

@sky333999 sky333999 Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mind fixing the indent on the prior line? if not, itll just happen the next time someone touches this file.
Can be done when we have to update the go.mod to bring in the receiver changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants