-
Notifications
You must be signed in to change notification settings - Fork 15
[GPII-3585]: Improve Stackdriver Ruby client to not update resources unnecessarily #231
Conversation
LGTM. Thanks @natarajaya for this. I must note this seems quite a tedious process (in particular updating the resources) and I think we should have a look at Terraform to manage this as soon as it has all the functionality we need. Alerting policies are there - https://www.terraform.io/docs/providers/google/r/monitoring_alert_policy.html ... what else are we waiting for? |
@stepanstipl Looks like only log-based metrics implementation is missing.
I don't think this should be a priority, but agree that getting rid of this custom code definitely benefits us long term. Feel free to start working on alternative implementation :) |
Agreed. What are the consequences for us when the API changes? Can we end up in a situation where an alerting policy silently stops working, leaving us without monitoring of critical infrastructure? Or will the malformed alerting policy cause an error... immediately? during the next deployment?
Agreed.
Disagreed :). However, do feel free to add these notes to the Too Hot For October ticket for later discussion. |
@@ -68,6 +68,36 @@ def get_alert_policy_identifier(alert_policy) | |||
return result | |||
end | |||
|
|||
def compare_alert_policies(stackdriver_alert_policy, alert_policy) | |||
stackdriver_alert_policy = JSON.parse(stackdriver_alert_policy.to_hash.to_json) | |||
["name", "creation_record", "mutated_by", "mutation_record"]. each do |attribute| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about how tightly coupled this is to the (apparently rapidly-evolving) Stackdriver API. I foresee a lot of build failures caused by Stackdriver's JSON format changing periodically.
Do you think an allowlist-based approach (only compare these keys) would be better than a denylist-based approach (compare all but these keys, as you're doing now)? It might provide us a little insulation from API changes.
Other than this, I'm not really sure what to do. Maybe we should follow Stepan's suggestion and look at moving this functionality back to Terraform in the short-term. Perhaps we should apply this PR (one last band-aid), see how much it improves alerting stability, and then re-evaluate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're going to have any issues related to Stackdriver's API changes unless we update google-cloud-monitoring
/ google-cloud-logging
gems.
I opened gpii-ops/exekube#28 to pin them down to their current versions.
Do you think an allowlist-based approach (only compare these keys) would be better than a denylist-based approach
Don't think this matters, because of previous point.
Perhaps we should apply this PR (one last band-aid), see how much it improves alerting stability, and then re-evaluate?
Agree with this.
To clarify - I don't think it should be number 1 priority, at least not before we deal with security stuff for NOVA-FERPA, I would also try to wait till TF has support for all the types of resources we need, but once that happens (and we're sure we're staying with Stackdriver) I would definitely prioritise this
I think that's a good place to add this to. |
LGTM, and agreed to keep tracking the resources that are implemented in TF in order to reduce this ruby code. |
@mrtyler Replied in thread. IMO worst case scenario that we may have at the moment is that resource comparison code going to think that resources are different and apply them anyway. |
@amatas @stepanstipl @mrtyler Thanks for your reviews! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is fine, but here is an example of the kind of breakage I fear could happen with an upstream change:
This implements comparison functions for Stackdriver resources to prevent unnecessary updates.
Unfortunately this task turned to be not so simple, because
google-cloud-monitoring
gem does not accept its own json primitives without some attribute correction.I also had to update all resources to reflect their actual state in Stackdriver.