Operational Excellence
The overall objective of the operational excellence pillar is to make sure you run and monitor systems to ensure that they are providing value for the business goals of the organization. This pillar focuses on operational best practices for running and monitoring systems and continuously improving processes and procedures.
Note
While many of us in technology find the cloud incredibly “cool,” we should never use technology just because it is very clever and exciting. Instead, we should use technology because it assists our organization in achieving the most important business objectives.
This pillar consists of the following important design principles:
Perform operations as code: When you are really humming along with operational excellence, you will be constructing your cloud infrastructure and services as code. The fancy acronym for this that is very popular now is IaC, which is short for infrastructure as code. Why the obsession with doing everything as code? Because it helps eliminate human errors and ensures consistency in your operations.
Make frequent small and reversible changes to the architecture to improve it: A big part of this pillar is ensuring that your solutions in AWS continue to evolve to help you achieve your business goals. When changes are small and reversible, it is very easy to roll back a change that produces undesirable results for you or your customers.
Refine your operational procedures frequently to improve them: When implementing operational procedures, you should always be vigilant in identifying chances for improvements. As your workload undergoes development, ensure that your procedures evolve accordingly. You should consider scheduling routine “game days” to assess and confirm the effectiveness of all procedures and to ensure that your teams are well acquainted with them.
Anticipate failures and have recovery plans in place: To meet this design goal, you should be engaged in testing, testing, and even more testing. Test failures and test your responses. Test how your teams react to failures and try and make the unknown variables known facts moving forward. It is much easier to operate in the face of adversity if you have thoroughly tested your failure responses and know that your recovery procedures are rock solid.
Learn from any operational failures in your architecture: You should promote the evolution of your AWS solutions by extracting insights from every operational event and—perhaps even more importantly—every failure. Of course, once you gain this knowledge, it is important to disseminate it to all teams and throughout the entire organization.