At Chaos Gears, we help our customers, companies of all shapes and sizes, utilize the AWS cloud to its full potential so they can focus on evolving their business while AWS does the heavy lifting for them.
One of these customers, a startup from the medical industry, has gained a global reach and thus, to serve its clients operates in multiple AWS Regions (currently ten with more scheduled to come) spanning many a timezone.
At the center of each region, there's an [EC2](https://aws.amazon.com/ec2/) instance that, by design, occasionally maxes out on the CPU.
When such happens, a member of the operations team runs a set of checks to determine whether the instance is reachable; if it is not, it gets restarted.
Once restarted and back online, which takes a few minutes, the same round of checks recommences.
More often than not, this proves sufficient, and the on-call engineer handling the issue can get back to work, sleep, or whatever else they were doing when the situation arose.
Being a startup lacking the resources to man a follow-the-sun operations team, our customer came to us requesting a simple, adjustable, and cost-effective solution that would relieve their engineers from this operational burden.
This post looks at such a solution, a multi-regional first-line of support.
Infrastructure as Code
In today's world of agile software development, we treat everything as code, or at least we should be doing that. Hence, the first decision we made was to bet on Cloud Development Kit (CDK), [a multi-language software development framework for modelling cloud infrastructure as reusable components](https://www.youtube.com/watch?v=ZWCvNFUN-sU), as our [Infrastructure as Code (IaC)](https://en.wikipedia.org/wiki/Infrastructure_as_code) tool.
Our customer's software engineers were already familiar with [TypeScript](https://www.typescriptlang.org/) (the language we chose to build out the infra with), which meant they'd comprehend the final solution quickly.
Moreover, we avoided the steep learning curve of mastering a [domain-specific language (DSL)](https://en.wikipedia.org/wiki/Domain-specific_language) and the additional burden of handling an unfamiliar codebase.
The recent introduction of CDK integration with AWS SAM, which is [now in public preview](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-cdk-getting-started.html), [allows developing serverless applications seamlessly within an AWS CDK project](https://aws.amazon.com/blogs/compute/better-together-aws-sam-and-aws-cdk/).
On top of all of that, we could reuse the existing software tooling like linters and apply the industry's coding best practices.
The adage says that "No server is better than no server," and with that in mind, we turned our heads towards AWS Step Functions, [a serverless orchestrator for AWS Lambda functions and other AWS services](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html).
The challenge at hand was perfect for an event-driven architecture and we had already envisioned the subsequent steps of the verification process (URL health check, [Route 53 health check](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html), SSH check, restart, etc.) as distinct [Lambda](https://aws.amazon.com/lambda/) functions passing the event object between themselves.
We needed the glue, and with AWS Step Functions, we effortlessly combined all those pieces [without worrying about server provisioning, maintenance, retries, and error handling](https://aws.amazon.com/step-functions/?step-functions.sort-by=item.additionalFields.postDateTime&step-functions.sort-order=desc).
We had the backbone figured out, but we still had to decide how to monitor the CPU usage on the EC2 instances and pass the knowledge of a breach to AWS Step Functions State Machine.
It screamed of [Amazon CLoudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) for the metric monitoring bit and [EventBridge (formerly CloudWatch Events)](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) for creating a rule for routing the alarm event to the target (a state machine in our case).
When the `CPUUtilization` metric for a given instance reaches 100%, a CloudWatch alarm enters the `alarm` state.
This state change gets picked up by an Amazon EventBridge rule that triggers the AWS Step Functions state machine.
Upon receiving the event object from the Amazon EventBridge rule, the state machine orchestrates the following workflow:
1. Three checks run, one after another (a URL check, a Route 53 check, and an SSH check).
2. If all checks succeed during the first run, the execution ends silently (the `All good` step followed by the `End` field).
3. When a check fails, the EC2 instance is restarted, and we recommence from the beginning with a second run.
4. If all checks succeed during the second run, a Slack notification is sent, and the execution ends (the `Slack` step followed by the `End` field).
5. When a check fails during the second run, the OpsGenie alert is created, and the execution ends (the `OpsGenie` step followed by the `End` field).
Here's the diagram depicting the complete solution:
All of the hereinabove mentioned resources, plus the Lambda functions, an S3 bucket for the Lambda code packages, and all the necessary IAM roles and policies are created and managed by AWS CDK and AWS SAM.
Furthermore, this solution can be deployed effortlessly to multiple regions using [AWS CDK's environments](https://docs.aws.amazon.com/cdk/latest/guide/environments.html).
Dissecting the code
I clone the repo and enter it:
In the project's root directory, I see [the `tsconfig.json` file responsible for configuring the TypeScript's compiler](https://www.typescriptlang.org/docs/handbook/tsconfig-json.html)
These two configuration files serve the entire project since we use TypeScript for both the infrastructure and application layers.
enables and encourages [the DevOps culture](https://www.youtube.com/watch?v=mBU3AJ3j1rg) by making the end-to-end development experience more uniform as you can use familiar tools and frameworks across your whole stack.
Now, let's take apart the `bin/cpu-check-cdk.ts` file, the point of entry to our CDK app, whence all stacks are instantiated:
I. We import all of the necessary dependencies - [in CDK v2, which is now in the developer preview, all of the CDK libraries are consolidated in one package](https://aws.amazon.com/about-aws/whats-new/2021/04/aws-cloud-development-kit-aws-cdk--v2-and-go-cdk-is-now-available-for-developer-preview/)
II. We check whether all of the necessary environment variables have been set
III. We initialize [the CDK app construct](https://docs.aws.amazon.com/cdk/latest/guide/apps.html#apps_construct)
IV. We grab the regions to which to deploy along with corresponding instance IDs to monitor from [CDK's context](https://docs.aws.amazon.com/cdk/latest/guide/context.html)
V. We create a tags object with the app's version and repo's URL taken directly from [the package.json file](https://nodejs.dev/learn/the-package-json-guide):
VI. Finally, we loop through the map of regions and corresponding instance IDs we created in step IV.
In each region, we produce eight stacks: one for every Lambda function, one for the state machine, and one for the metric, alarm, and rule combo.
Thanks to using [a basic programming concept of the for loop](https://en.wikipedia.org/wiki/For_loop), we could save ourselves from unnecessary duplication by keeping things [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself).
Nice and easy, and all in one go, regardless of the number of regions to which we would want to deploy, and mind you, [there are 25 available (with six more to come)](https://aws.amazon.com/about-aws/global-infrastructure/).
I won't be going through all of the CDK and Lambda files (though [I strongly encourage you to give the code a thorough review](https://github.com/rafalkrol-xyz/aws-blog-cpu-check-customer-story)).
Notwithstanding, let us see how easy it is to define a [stack](https://docs.aws.amazon.com/cdk/latest/guide/stacks.html) class in CDK looking at the `lib/lambda-stack.ts` file:
I. We import the dependencies
You'll have noticed that there's also a helper function called `capitalizeAndRemoveDashes` amongst the CDK libs.
Since CDK uses a GPL, we can introduce any amount of custom logic, as we could do with a _regular_ application.
The `lib/helpers.ts` file looks as follows:
II. We extend the default stack properties (like [description](https://docs.aws.amazon.com/cdk/api/latest/docs/@aws-cdk_core.Stack.html#description)) with our ones, setting some as mandatory and some as optional.
III. We start a declaration of the `LambdaStack` class with a `lambdaFunction` [read-only property](https://www.typescriptlang.org/docs/handbook/2/classes.html#readonly) and [a constructor](https://www.typescriptlang.org/docs/handbook/2/classes.html#constructors)
IV. We create a resource name out of the mandatory name property that will be passed in during the class initialization
V. We create an IAM role that the Lambda service can assume. We add the `service-role/AWSLambdaBasicExecutionRole` AWS-managed policy to it, and, if provided, a custom user-managed policy.
VI. We initialize [a construct](https://docs.aws.amazon.com/cdk/latest/guide/constructs.html) of the Lambda function using the role defined in step V and stack properties, or arbitrary defaults if stack properties were not provided.
VII. Finally, we expose the Lambda function object as the class's read-only property we defined in step III. (We're also sure to close our brackets to avoid the implosion of the universe.)
In this blog post, I showed how we put together a serverless application running AWS Lambda under the baton of AWS Step Functions to relieve our customer's engineers from some of their operational burdens so that they could focus more on evolving their business.
The described approach could be adapted to serve other needs or cover different cases as AWS Step Functions' visual workflows allow for a superquick translation of business requirements to technical ones.
By using AWS CDK as the IaC tool, we were able to write all of the code in TypeScript, which puts us in an excellent position for future improvements.
We avoided the trap of introducing unnecessary complexity and kept things concise with codebase approachable and comprehensive to all team members.
Lastly, [please be sure to check out the GitHub repository]((https://github.com/rafalkrol-xyz/aws-blog-cpu-check-customer-story)), and if you'd want to learn more about cooperating with Chaos Gears, [please visit our website](https://chaosgears.com/).