Skip to content

CA-404020: Do not fail when removing a non-existing datasource #6199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

psafont
Copy link
Member

@psafont psafont commented Dec 27, 2024

The latest changes in the metrics daemon changes how archived and live metrics
are synchronised, and archived sources are created less often. This meant that
on some cases, like SR.forget, the operations failed because they could query
for existing sources, but the call to remove them failed, because the metrics
only exist in live form, not archived one.

(what happens with the live one, do they disappear when the SR is forgotten?)

Signed-off-by: Pau Ruiz Safont [email protected]

The latest changes in the metrics daemon changes how archived and live metrics
are synchronised, and archived sources are created less often. This meant that
on some cases, like SR.forget, the operations failed because they could query
for existing sources, but the call to remove them failed, because the metrics
only exist in live form, not archived one.

(what happens with the live one, do they disappear when the SR is forgotten?)

Signed-off-by: Pau Ruiz Safont <[email protected]>
Also removes some traversals when reading from XML data

Signed-off-by: Pau Ruiz Safont <[email protected]>
inner rrd n
)
rrd removals_required
List.sort_uniq String.compare ds_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to sort here? Then still use Utils.setify ds_names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to drop setify in as many places as possible from the codebase, it has a cost of O(n^2). sort_uniq should be faster in most cases. Since we don't need to maintain order of the elements the latter can be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to avoid O(n^2) behaviour then we could use List.sort instead of List.sort_uniq, and see how many consecutive equal elements we find: those are the duplicates that we need to remove.
List.filter_map + List.filter is still O(n^2).

@gangj
Copy link
Contributor

gangj commented Jan 1, 2025

Would you please share how archived and live metrics are synchronized? With this fix, is it possible for some entry to be synchronized from live to archived after an attempt to remove it by rrd_remove_ds? And is this a valid case?

@psafont
Copy link
Member Author

psafont commented Jan 2, 2025

Would you please share how archived and live metrics are synchronized?

The change is about datasources, not metrics. Archived datasources hold metadata when xapi asks to disable or enable a datasource. This bit of information is not stored in the datastructures of the live datasource, and its lifetime is also independent, which means it can't be stored easily in the live datastructure because the former needs to survive the latter when the datasource stops reporting.

@psafont
Copy link
Member Author

psafont commented Jan 2, 2025

With this fix, is it possible for some entry to be synchronized from live to archived after an attempt to remove it by rrd_remove_ds?

Remove can be called at any time, so data races are possible, theoretically. In practice I haven't observed a mechanism that synchronises live datasources with archival / default ones.

Datasources about SRs are in a specially awkward place because they are reported currently as hosts metrics. This means that once they exist as live datasources, they are never removed from memory, because there's no mechanism for that. There's a mechanism to manage SR datasources independently, and there's a lot of code for it, but it's currently unused. VM do not have this problem.

Copy link
Contributor

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization discussed to avoid List.filter + List.filter_map could be continued in another PR, this PR is already an improvement over what was there before.

@Vincent-lau
Copy link
Contributor

Vincent-lau commented Jan 2, 2025

The problem here, as @psafont suggested, is that query_possible_dss looks at the live and disabled datasources, and look for relevant ones to delete. In this particular case, it is avgqu_sz_296626b4 rrd, which turns out to be a live one. While the forget_host_ds removes a datasource by name in the disabled/archived datasources, i.e. the rrdi.rrd.rrd_dss, and ignoring all the live datasources, i.e. rrdi.dss. I am not sure why this is behaviour is like this, but if what we meant by forget_host_ds is to forget all the ds that is archived, then this PR would be good.

@psafont
Copy link
Member Author

psafont commented Jan 2, 2025

That's a limitation of the current design, and needs careful readjusting of all the SR-related metrics: all the names need to be changed (do multiple hosts report about the same SR?). I would avoid trying to fix it here, as it adds quite a lot of risk in the change.

@psafont psafont added this pull request to the merge queue Jan 3, 2025
Merged via the queue into xapi-project:master with commit c264219 Jan 3, 2025
15 checks passed
@psafont psafont deleted the private/paus/datasource-confusion branch March 24, 2025 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants