- These notes are part of a broader set of principles
- Practices in this section contribute to service reliability
- See also observability
- This is related to:
-
Configure all infrastructure using declarative code such as Terraform or CloudFormation (see everything as code).
-
Automate monitoring and alerting (see automate everything and observability.
-
Prefer serverless platform as a service (PaaS) over infrastructure as a service (IaaS) (see outsource bottom up).
-
Where not serverless use ephemeral and immutable infrastructure.
-
Engage your cloud supplier early on in the development process. They have various tools and processes to help you (e.g. AWS Well-Architected Review).
-
Understand cloud supplier SLAs.
-
Make systems self-healing.
- Prefer technologies which are resilient by default.
- Favour global-scoped (e.g. CloudFront or Front Door) or region-scoped services (e.g. S3, Lambda, Azure Functions) to availability-zone (AZ) scoped (e.g. VMs, RDS DBs) or single-instance services (e.g. EC2 instance storage).
- For AZ-scoped services, use redundancy to create required resilience (e.g. AWS Auto Scaling Groups or Azure Scale/Availability Sets), and:
- For stateless components use active-active configurations across AZs (e.g. running stateless containers across multiple AZs using AWS Elastic Kubernetes Service)
- For stateful components, e.g. databases, consider use of active-active configurations across AZs (e.g. Aurora Multi-Master), but be aware of the added complexity conflict resolution for asynchronous replication can bring and potential performance impact where synchronous replication is chosen.
- Consider use of multiple regions (e.g. for AWS eu-west-1 [Dublin] as well as eu-west-2 [London]) as a way to improve availability, though ensure data sovereignty implications are understood and accepted (see below).
- Understand failover (e.g. RDS failover) and failed instance replacement times and engineer to accommodate these.
-
Be aware of data sovereignty implications of using any systems hosted outside the UK.
- Make sure your information governance lead is aware and included in decision making.
- Consider SaaS tools the team uses as well as the systems we build.
-
Services should scale automatically up and down.
- If possible, drive scaling based on metrics which matter to users (e.g. response time), but balance this with the benefits of choosing leading indicators (e.g. CPU usage) to avoid slow scaling from impacting user experience.
- Understand how rapidly demand can spike and ensure scaling can meet these requirements. Balance scaling needs with the desire to avoid over provisioning and use pre-warming of judiciously where required. Discuss this with the cloud provider well before go live they can assist with pre-warming processes (AWS).
-
As a rule of thumb, where you are using inelastic infrastructure, aim for 80% utilisation.
- Don't let utilisation rise far enough that a single instance failing would cause an outage.
- Too high utilisation will cause latency problems. Know what your performance SLOs are to understand how much latency headroom you have.
-
Keep up to date.
- Services/components need prompt updates to dependencies where security vulnerabilities are found — even if they are not under active development.
- Services which use deprecated or unsupported technologies should be migrated onto alternatives as a priority.
-
Understand and be able to justify vendor lock in (see outsource from the bottom up).
-
Build in governance as a side effect, e.g.
-
Segregate production and non-production workloads
Production and non-production workloads should be deployed into separate cloud subscriptions (Azure) or accounts (AWS) to enforce clear security boundaries, reduce risk of accidental impact and simplify policy enforcement. This separation enables:
- Tighter access control for production, ensuring only the necessary users and automation have access
- Application of different Azure or AWS policies and guardrails (e.g. cost controls, logging requirements, monitoring sensitivity)
- Easier environment-specific cost tracking (especially in showback/chargeback models)
- Safer testing and change validation, supporting the DevOps approach of "rapid, iterative and incremental change" through controlled progression across environments (e.g. Dev → Int → NFT → Preprod → Prod)
This structure is also aligned with the Cloud Adoption Framework for Azure and AWS, which recommend using subscriptions as units of governance and risk isolation.
-
Segregate products
Each product should operate within its own set of cloud subscriptions (Azure) or accounts (AWS), rather than being co-located with other products in a large shared environment. This aligns infrastructure with product boundaries, enabling:
- Empowered and autonomous teams (a core principle) to own, operate and iterate on their environments independently, enabling clear ownership and accountability
- Improved cost attribution for budgeting and forecasting, essential for long-living products supported by outcome teams
- Reduced risk of cross-product failure, misconfiguration or conflicting changes, and their blast radius
- Better alignment to Conway’s Law, Team Topologies and Domain-Driven Design, where infrastructure reflects the structure and ownership of the team, accelerating delivery and supporting flow
- Scalable approach to managing the product lifecycle: as each product evolves, is replatformed or retired, its resources can be managed in isolation
By segregating subscriptions per product, we can reduce friction between teams, improve lifecycle management and support the "you build it, you run it" DevOps approch.
-
Infrastructure must be tagged to identity the service so that unnecessary resources don't go unnoticed (click to expand)
AWS Config rule to identify EC2 assets not tagged with "CostCenter" and "Owner":
{ "ConfigRuleName": "RequiredTagsForEC2Instances", "Description": "Checks whether the CostCenter and Owner tags are applied to EC2 instances.", "Scope": { "ComplianceResourceTypes": [ "AWS::EC2::Instance" ] }, "Source": { "Owner": "AWS", "SourceIdentifier": "REQUIRED_TAGS" }, "InputParameters": "{\"tag1Key\":\"CostCenter\",\"tag2Key\":\"Owner\"}" }
Further reading: AWS Config
TO DO: Azure equivalent
- Configure audit tools such as CloudTrail.
-