knowledge hub

Don’t believe in what you see, unless you’ve tested it already

AWS RDS integrity testing tool based on a serverless approach

Author
Sebastian Respondek

For one of our customers, a digital healthcare company and provider of a solution for patient engagement, we’ve developed an AWS RDS integrity testing tool based on a serverless approach. We have used a combination of AWS Step Functions, AWS Lambda, AWS SSM and AWS KMS. In short, this is not solely a story about technological concept but also about a way of dealing with such problems in a startup that finds itself under the pressure of other priorities.

It’s all about the data you store

We all live in a world wherethe data has become a new currency. Any potential loss or databasedowntime means loss of income. Whichbrings us to the key question: how will you, as an organization,ensure that, once the failure springs up, youwill have thequality and the completeness of data? Essentially, we all are awareof failures, but how many of us ever tested a restoration frompreviously created AWS RDS snapshots. Not 100% of us, certainly.Personally, I’ve seen many examples of a “completely believing inmagic power of the cloud” attitude and heard many times peoplesaying: “we do not need to worry about the data, we’ve got aregular system snapshots done by AWS”. Once the failure comes up(developer deletes a tiny part of the data “accidentally”), it’salways too late for guessing. I won’t get into details and dwell onhow you feel when you have to act fast, especially when dealing witha problem you have never tested before. Let’s just say that your heart races like a speeding train.

Better safe than sorry


“Anold Russian proverb says: “Trust,but verify”. For our internal and external ISO 27001 audits, weneeded a clear report
of the integrity of our encrypted backups.Chaos Gears implemented a method to make the integrity of our backups easy 
to demonstrate to internal and external auditors”.

Up to this point, our client had RDS databases launched in 8 different AWS regions. Keeping the data consistent and being able to restore it from a particular backup in case of an outage, always top the list ofour priorities. Unfortunately, in this case there was no habit of regular and internal testing. So, the right time came with an ISO audit.

Keep it simple

Thebalance between tasks automation and time spent on building automatedflows is a question that always sparks long debates. Our team decidedto avoid developing something that would introduce additional maintenance overhead and, more importantly, one that would certainlykill our monthly billing, just because we wanted to save some time.That’s why we leveraged AWS Step Functions to close the wholeworkflow in a single place.

NOTE: For those who have never worked with AWS Step Functions, it is a serverlessfunction orchestrator that makes it easy to sequence AWS Lambdafunctions and multiple AWS services into business-criticalapplications.

If costs are fine, then you’re on the right track

Whenyou use Step Functions, you are charged based on the number of state transitions required to execute your application. After the freetier, which includes 4,000 state transitions per month,you pay $0.025 per 1,000 state transitions. Our client would accept an automated approach if a short development period could overlap with low end costs.

Let’s design the workflow

Step Functions is a great way to build and step through series of AWS services in a matter of minutes. In this case, it also addressed few concerns that our client specified as highly important from his perspective:

  • new solution cannot lose its state
  • dealing with errors and timeouts must be done in an easy way
    (NOTE: Step Functions includes built-in retry conditions that allow us to set a number of times you want a certain function to be retried before it goes to a failed state and error handling)
  • not investing much time in building the flow and operating it afterwards
  • receiving a human friendly report after each test and storing it securely
  • the solution has to be auditable

Below, you’ll find what we’ve developed as an initial version of the flow which, right now,is on a roadmap for future development.

1. Extract and save credentials from the event into AWS SSM service. Encryption being done via AWS KMS
2. Try to restore the RDS from the latest snapshot
3. Check RDS status during the restoration
4. Choice step:
a. If new RDS is not restored yet, then jump into the WAIT step
b. If new RDS is successfully restored, jump into the next step
5. WAIT step - wait for n seconds and then go back to point 3
6. Get the RDS endpoint
7. Fetch the credentials from AWS SSM based on the input parameter name then establish the connection with the new RDS endpoint. Run the following simple SQL queries:
a. Get Postgres version
b. Retrieve a number of tables
c. Retrieve a number of users
8. Save the report in S3 bucket
9. Parallel step - destroy the previously created RDS and send Slack notification

IMPORTANT: All of the steps mentioned above work as Job with a try-catch approach.In a nutshell, if any of the steps from 1 to 9 fail, then it isimmediately directed to the “Destroying RDS Instance/ Slack Notification” step.

Next steps

Basically, our company is full of enthusiasts, who like to take small steps budget tangible results. So, here’s what our next priorities look like:

  • define more complex SQL queries for tests
  • get rid of passing by credentials in point 1 and change it into something more reliable
  • deal with Step Functions limit, which is 90 days. After that, you can no longer retrieve or view the execution history. There is no further quota for the number of closed executions that Step Functions retains.

Conclusion

Problem solving is all about seeking relatively easiest solutions, provided they exist. Honestly, I wouldn't treat leveraging serverless AWS services as a remedy for every problem. However, speaking from my own experience, it definitely allows us to quickly check the idea against the solution. With the example depicted above, I was able to provide an already working version for tests after just 2 days, and I didn’t have to think about non relevant setup issues.

Technology Stack

AWS Step Functions
AWS Step Functions
AWS Lambda
AWS Lambda
SSM Agent
SSM Agent
AWS KMS
AWS KMS
Amazon RDS
Amazon RDS