Don’t believe in what you see, unless you’ve tested it already

For one of our customers, a digital healthcare company and provider of a solution for patient engagement, we’ve developed an AWS RDS integrity testing tool based on a serverless approach. We have used a combination of AWS Step Functions, AWS Lambda, AWS SSM and AWS KMS. In short, this is not solely a story about atechnological concept but also about a way of dealing with such problems in a startup that finds itself under the pressure of other priorities.

It’s all about the data you store

We all live in a world where data has become a currency. Any potential loss or database downtime means loss of income. Which brings us to the key question: how will you, as an organization, ensure that, once the failure springs up, you will have the quality and the completeness of data?

Essentially, we all are aware of failures, but how many of us ever tested a restoration from previously created AWS RDS snapshots. Not all of us, certainly. Personally, I’ve seen many examples of a “the cloud has magic powers” attitude and heard many people saying: “we do not need to worry about data, we’ve got regular system snapshots done by AWS”.

Once the failure creeps in (developer deletes a tiny part of the data “accidentally”), it’s always too late for guessing. I won’t get into details and dwell on how you feel when you have to act fast, especially when dealing with a problem you have never tested before. Let’s just say that your heart races like a speeding train.

Better safe than sorry

An old Russian proverb says: “Trust, but verify”. For our internal and external ISO 27001 audits, we needed a clear report of the integrity of our encrypted backups. Chaos Gears implemented a method to make the integrity of our backups easy to demonstrate to internal and external auditors”.

Up to this point, our client had RDS databases launched in 8 different AWS regions. Keeping the data consistent and being able to restore it from a particular backup in case of an outage is always at the top of our priorities. Unfortunately, in this case, there was no habit of regular and internal testing. So, the right time came with an ISO audit.

Keep it simple

The balance between tasks automation and time spent on building automated flows is a question that always sparks long debates. Our team decided to avoid developing something that would introduce additional maintenance overhead and, more importantly, one that would certainly kill our monthly billing, just because we wanted to save some time. That’s why we leveraged AWS Step Functions to close the whole workflow in a single place.

NOTE: For those who have never worked with AWS Step Functions, it is a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business-critical applications.

If costs are fine, then you’re on the right track

When you use AWS Step Functions, you are charged based on the number of state transitions required to execute your application. After the free tier, which includes 4,000 state transitions per month, you pay $0.025 per 1,000 state transitions. Our client would accept an automated approach if a short development period could overlap with low end costs.

Let’s design the workflow

AWS Step Functions are a great way to build and step through series of AWS services in a matter of minutes. In this case, it also addressed few concerns that our client specified as highly important from his perspective:

new solution cannot lose its state,
errors and timeout handling must be simple,
(NOTE: Step Functions includes built-in retry conditions that allow us to set a number of times you want a certain function to be retried before it goes to a failed state and error handling),
not investing much time in building the flow and operating it afterwards,
receiving a human friendly report after each test and storing it securely,
the solution has to be auditable.

Below, you’ll find what we’ve developed as an initial version of the flow which, right now, is on a roadmap for future development.

Extract and save credentials from the event into AWS SSM service. Encryption being done via AWS KMS;
Try to restore the RDS from the latest snapshot;
Check RDS status during the restoration;
Choice step:
1. If new RDS is not restored yet, then jump into the WAIT step,
2. If new RDS is successfully restored, jump into the next step.
WAIT step - wait for n seconds and then go back to point 3;
Get the RDS endpoint;
Fetch the credentials from AWS SSM based on the input parameter name then establish the connection with the new RDS endpoint. Run the following simple SQL queries:
1. Get Postgres version,
2. Retrieve the table count,
3. Retrieve the user count.
Save the report in a S3 bucket;
Parallel step — destroy the previously created RDS and send Slack notification</dt>

IMPORTANT: All of the steps mentioned above work as Job with a try-catch approach.In a nutshell, if any of the steps from 1 to 9 fail, then it isimmediately directed to the “Destroying RDS Instance/ Slack Notification” step.

Next steps

Basically, our company is full of enthusiasts, who like to take small steps with tangible results. So, here’s what our next priorities look like:

define more complex SQL queries for tests,
get rid of passing by credentials in point 1 and change it into something more reliable,
deal with Step Functions limit, which is 90 days. After that, you can no longer retrieve or view the execution history. There is no further quota for the number of closed executions that Step Functions retains.

Conclusion

Problem solving is all about seeking relatively easiest solutions, provided they exist. Honestly, I wouldn't treat serverless AWS services as a remedy for every problem. However, speaking from my own experience, it definitely allows us to quickly check the idea against the solution. With the example depicted above, I was able to provide an already working version for tests after just 2 days, and I didn’t have to think about non relevant setup issues.

Don’t believe in what you see, unless you’ve tested it already

It’s all about the data you store

Better safe than sorry

Keep it simple

Let’s design the workflow

Next steps

Conclusion

Technologies

AWS Step Functions

AWS Lambda

SSM Agent

AWS KMS

Amazon RDS

Remaining chapters

Related articles

Why is our workflow so slow?! — technological reasons

CI/CD on AWS: What does AWS have to offer and how can I extend it

Let's talk about your project