Skip to content

blueprint_planner background task #8287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

blueprint_planner background task #8287

wants to merge 11 commits into from

Conversation

plotnick
Copy link
Contributor

@plotnick plotnick commented Jun 6, 2025

Fixes #8244. Adds a new blueprint_planner background task that periodically invokes the planer. If the resulting blueprint is different than the current target, it is saved and becomes the new target.

Todo:

  • Tests

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have jumped the gun in looking at this but it looks great.

@plotnick plotnick force-pushed the blueprint-planner branch from 0d61d6f to dfd9bba Compare June 7, 2025 17:53
@plotnick plotnick force-pushed the blueprint-planner branch from 8e5379b to 53b7cb0 Compare June 10, 2025 00:05
@plotnick
Copy link
Contributor Author

53b7cb0 adds a working blueprint_planner test, but breaks dns::test::test_silos_external_dns_end_to_end for reasons I don't understand. Anyone have any ideas?

@davepacheco
Copy link
Collaborator

53b7cb0 adds a working blueprint_planner test, but breaks dns::test::test_silos_external_dns_end_to_end for reasons I don't understand. Anyone have any ideas?

The test is failing because it (manually) executed the system's initial blueprint, tried to verify that the internal DNS generation didn't change, but it had changed. One guess is that maybe the planner task is enabled in the test suite environment? (You could check by running cargo xtask omicron-dev run-all and then omdb nexus background-tasks show blueprint_planner against it.) If it were, and if the planner immediately made some non-trivial change (for some other reason I don't know), then maybe the internal DNS changed as part of that. There should also be some evidence in the Nexus log file, though there's a lot of noise to sift through. I can help dig into this tomorrow.

It looks like an omdb test also failed. It looks like that one failed after listing blueprints and then trying to show a specific one. It failed because that blueprint is trying to add a boundary NTP zone. So yeah, I wonder if the planner is enabled in the test suite by default and shouldn't be? That should be controlled by nexus/tests/config.test.toml.

@plotnick
Copy link
Contributor Author

plotnick commented Jun 11, 2025

The test is failing because it (manually) executed the system's initial blueprint, tried to verify that the internal DNS generation didn't change, but it had changed.

So, in able to run the planner in the test context, we need a boundary NTP zone in the initial setup. The presence of such a zone causes the DNS generation number bump on execution (presumably because it's adding the NTP server to external DNS), regardless of whether the planner is enabled. Should I update the test to reflect this change to its environment, or try something else (suggestions welcome)?

@davepacheco
Copy link
Collaborator

The test is failing because it (manually) executed the system's initial blueprint, tried to verify that the internal DNS generation didn't change, but it had changed.

So, in able to run the planner in the test context, we need a boundary NTP zone in the initial setup. The presence of such a zone causes the DNS generation number bump on execution (presumably because it's adding the NTP server to external DNS), regardless of whether the planner is enabled. Should I update the test to reflect this change to its environment, or try something else (suggestions welcome)?

Thanks for digging into that. If I'm understanding correctly, there are two separate issues here:

  1. I expect that enabling the automated planner in the test context is probably going to break a bunch of tests and we may not be able to identify them all ahead of time. Any existing tests that generate and insert blueprints, especially if they also make those blueprints the new target, are probably assuming that they're the only thing that's doing that. They may work if they get lucky but fail randomly based on timing. So I'd be inclined to just leave this off in the test configuration. That should fix the problem in this PR.
  2. The thing you mentioned about not having boundary NTP in the initial blueprint in the test context is still a problem and looks like the root cause of cannot run planner in simulated environments #8221. I hope we can fix this by putting a boundary NTP zone into that initial blueprint and DNS config. Then, running the planner in a test context ought to produce a blueprint with no diffs against the initial one. But all of this is moot if we disable the automated planner task in these tests by default.

@plotnick
Copy link
Contributor Author

Sorry, I think I unclear because I was referring to changes I hadn't pushed up. d1e2e14 correctly wires up the disable_planner switch, so that the background planner task is in fact disabled by default and in the test environment (verified using xtask omicron-dev run-all). But test_silos_external_dns_end_to_end still fails, because its expectations about what's changing/not changing on execution are now violated due to the presense of the new boundary NTP zone introduced in 53b7cb0 (see configure_boundary_ntp).

I have also confirmed (again with omicron-dev) that as of 5a0d82e, this fixes #8221, but does not fix #8222. The error from the latter confuses me but makes me think (possibly incorrectly) that it's related to the failure I'm seeing in test_silos_external_dns_end_to_end.

@davepacheco
Copy link
Collaborator

Ah, thanks for that. The way this works in both a real system and the test environment is that we (well, RSS in the real system, and nexus-test-utils in the test environment) set up both an initial blueprint and (in parallel) an initial DNS configuration. Both get sent in the RSS-to-Nexus handoff. (This is somewhat an artifact of history, since the contents of internal DNS should be derivable from the blueprint. But the DNS config needs to be sent to the DNS servers before we can do the Nexus handoff (so Nexus can find CockroachDB, CockroachDB can find each other, etc.) so it also kind of makes sense to send Nexus what the DNS servers were configured with.) So in a working system, if you were to execute the initial blueprint, there would be no changes to internal DNS because it was constructed to match what the initial blueprint specifies.

It looks like 53bc7cb0 added the boundary NTP zone to the blueprint, but not the internal DNS config. If you look at say start_crdb_impl, start_clickhouse, "start_nexus_internal", etc. all of these things push a new BlueprintZoneConfig onto self.blueprint_zones and also update the DNS config (with self.rack_init_builder.add_XXX_to_dns()). I think if you add a similar thing to configure_boundary_ntp, then the initial DNS should match what the initial blueprint says it will be, and then executing the initial blueprint won't trigger any DNS changes. And then this test won't panic there.

It does look like this test is also hitting #8222, but I think it doesn't actually notice that (which would explain why it's not failing on "main", even though 8222 has been present for some time) and it doesn't affect what it's trying to test.

@plotnick
Copy link
Contributor Author

It looks like 53bc7cb0 added the boundary NTP zone to the blueprint, but not the internal DNS config.

Thank you, that was the bit I was missing ("it's always DNS"). f3f243a adds a dummy DNS entry, which fixes the test.

If I could trouble you one step further: I'm not actually sure what the "right" address is to register with internal DNS for the new boundary NTP server. But since it's not a real server, maybe it doesn't much matter (for now, at least)?

@davepacheco
Copy link
Collaborator

davepacheco commented Jun 11, 2025

So I guess I'd use that (omicron_common::address::NTP_PORT).

Thanks again, updated in f5ad974.

[edit: it seems I accidentally edited your comment instead of replying to a quoted excerpt of it. I have no idea how I managed to do that, but apologize anyway. @plotnick]

@plotnick plotnick force-pushed the blueprint-planner branch from 842d038 to 611f48b Compare June 11, 2025 19:56
@plotnick plotnick marked this pull request as ready for review June 11, 2025 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

automate the planner
3 participants