Skip to content
This repository was archived by the owner on Sep 21, 2020. It is now read-only.

[GPII-3585]: Improve Stackdriver Ruby client to not update resources unnecessarily #231

Merged
merged 5 commits into from
Dec 14, 2018

Conversation

natarajaya
Copy link
Contributor

This implements comparison functions for Stackdriver resources to prevent unnecessary updates.
Unfortunately this task turned to be not so simple, because google-cloud-monitoring gem does not accept its own json primitives without some attribute correction.
I also had to update all resources to reflect their actual state in Stackdriver.

@natarajaya natarajaya self-assigned this Dec 12, 2018
@stepanstipl
Copy link
Contributor

LGTM. Thanks @natarajaya for this. I must note this seems quite a tedious process (in particular updating the resources) and I think we should have a look at Terraform to manage this as soon as it has all the functionality we need.

Alerting policies are there - https://www.terraform.io/docs/providers/google/r/monitoring_alert_policy.html
Uptime checks merged - hashicorp/terraform-provider-google#2502
Notification channels merged - hashicorp/terraform-provider-google#2452
Monitoring groups (not sure what these are for) - hashicorp/terraform-provider-google#2451

... what else are we waiting for?

@natarajaya
Copy link
Contributor Author

what else are we waiting for?

@stepanstipl Looks like only log-based metrics implementation is missing.

we should have a look at Terraform to manage this as soon as it has all the functionality we need.

I don't think this should be a priority, but agree that getting rid of this custom code definitely benefits us long term. Feel free to start working on alternative implementation :)

@mrtyler
Copy link
Contributor

mrtyler commented Dec 13, 2018

this seems quite a tedious process (in particular updating the resources)

Agreed. What are the consequences for us when the API changes? Can we end up in a situation where an alerting policy silently stops working, leaving us without monitoring of critical infrastructure? Or will the malformed alerting policy cause an error... immediately? during the next deployment?

I don't think this should be a priority, but agree that getting rid of this custom code definitely benefits us long term.

Agreed.

Feel free to start working on alternative implementation :)

Disagreed :). However, do feel free to add these notes to the Too Hot For October ticket for later discussion.

@@ -68,6 +68,36 @@ def get_alert_policy_identifier(alert_policy)
return result
end

def compare_alert_policies(stackdriver_alert_policy, alert_policy)
stackdriver_alert_policy = JSON.parse(stackdriver_alert_policy.to_hash.to_json)
["name", "creation_record", "mutated_by", "mutation_record"]. each do |attribute|
Copy link
Contributor

@mrtyler mrtyler Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about how tightly coupled this is to the (apparently rapidly-evolving) Stackdriver API. I foresee a lot of build failures caused by Stackdriver's JSON format changing periodically.

Do you think an allowlist-based approach (only compare these keys) would be better than a denylist-based approach (compare all but these keys, as you're doing now)? It might provide us a little insulation from API changes.

Other than this, I'm not really sure what to do. Maybe we should follow Stepan's suggestion and look at moving this functionality back to Terraform in the short-term. Perhaps we should apply this PR (one last band-aid), see how much it improves alerting stability, and then re-evaluate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're going to have any issues related to Stackdriver's API changes unless we update google-cloud-monitoring / google-cloud-logging gems.

I opened gpii-ops/exekube#28 to pin them down to their current versions.

Do you think an allowlist-based approach (only compare these keys) would be better than a denylist-based approach

Don't think this matters, because of previous point.

Perhaps we should apply this PR (one last band-aid), see how much it improves alerting stability, and then re-evaluate?

Agree with this.

@stepanstipl
Copy link
Contributor

I don't think this should be a priority, but agree that getting rid of this custom code definitely benefits us long term. Feel free to start working on alternative implementation :)

To clarify - I don't think it should be number 1 priority, at least not before we deal with security stuff for NOVA-FERPA, I would also try to wait till TF has support for all the types of resources we need, but once that happens (and we're sure we're staying with Stackdriver) I would definitely prioritise this

Disagreed :). However, do feel free to add these notes to the Too Hot For October ticket for later discussion.

I think that's a good place to add this to.

@amatas
Copy link
Contributor

amatas commented Dec 13, 2018

LGTM, and agreed to keep tracking the resources that are implemented in TF in order to reduce this ruby code.

@natarajaya
Copy link
Contributor Author

What are the consequences for us when the API changes? Can we end up in a situation where an alerting policy silently stops working, leaving us without monitoring of critical infrastructure?

@mrtyler Replied in thread. IMO worst case scenario that we may have at the moment is that resource comparison code going to think that resources are different and apply them anyway.

@natarajaya
Copy link
Contributor Author

@amatas @stepanstipl @mrtyler Thanks for your reviews!

Copy link
Contributor

@mrtyler mrtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This is fine, but here is an example of the kind of breakage I fear could happen with an upstream change:

@natarajaya natarajaya merged commit 97de31f into gpii-ops:master Dec 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants