Reliability
The reliability pillar consists of many important design principles that all focus on ensuring that your design can easily recover from service failures. It also ensures that your architecture can grow resources as needed on demand. Reliability in the cloud also means that disruptions can be mitigated with relative ease. Sound great? It is. A major factor in the popularity of the public cloud is the ability to dramatically increase your IT reliability without massive expenditures that might be required in your traditional IT environment.
Here are the design goals around this pillar:
Automate failure recovery as much as possible: If automation should be used with security controls, it makes sense that you should carefully monitor for issues with our AWS solutions and automate appropriate responses. Think about AWS Auto Scaling as an example. If your AWS solution is in danger of running low on resources, AWS can automatically add more. Once things settle back down to normal, AWS Auto Scaling can reduce the resources consumed. In this specific example, you even get the added benefit of cost optimization, which is one of the upcoming pillars.
Test recovery: We tend to practice and work with our backups and their strategies, but unfortunately, we do not tend to test and work with our restore procedures. If your organization falls into this trap, you might have chaos when it comes time to restore things. The AWS Cloud allows you to fully test recovery scenarios for all types of failure scenarios. In fact, if you followed the previous design principle of automation for failure recovery, then what you are really testing here is the automation you set up.
Automatically scale horizontally when needed: With AWS, you can scale vertically or horizontally. For example, with an EC2 virtual machine type of solution, you could scale vertically by adding more resources to a single EC2 VM. AWS does not want you to use this approach if you can avoid it. Rather, AWS wants you to scale horizontally. This means adding more small, efficient VMs to handle the increase in demand. Of course, you need to distribute client requests across these different VMs. Once you have all that set up, you experience the benefit of having no single point of failure, which increases reliability even more.
Stop guessing at capacity for IT resources: This design principle is so important that it appeared in the list of general recommendations earlier in this chapter. Notice that this design principle is referring to a problem that often occurs in traditional IT environments: Engineers are often forced to guess at the capacity that solutions need, and at some point, their guess is not accurate. Starvation of resources begins to occur, and there is a mad scramble, typically accompanied by a big jump in spending to fix the issues. In the cloud, with tools like Auto Scaling at your disposal, you do not need to guess at capacity at all. You know you have the massive power and quantity of AWS resources at your disposal, as well as the benefit of their economies of scale. (Remember that we covered economies of scale with AWS in Chapter 2, “Some Benefits of the AWS Cloud.”)
Manage changes through automation: Yes, it’s automation again. Earlier we established that changes should be small and reversible. To this list we now add the excellent characteristic of automated. Let’s say you know you are going to need additional AWS user accounts that have monitor privileges for EC2 and Lambda running in your solution. You should create a script that automates the creation of these accounts. This is easy to do, thanks to the AWS command-line interface (CLI).