For one of our customers, a digital healthcare company and provider of a solution for patient engagement, we’ve developed an AWS RDS integrity testing tool based on a serverless approach. We have used a combination of AWS Step Functions, AWS Lambda, AWS SSM and AWS KMS. In short, this is not solely a story about technological concept but also about a way of dealing with such problems in a startup that finds itself under the pressure of other priorities.
It’s all about the data you store
We all live in a world wherethe data has become a new currency. Any potential loss or databasedowntime means loss of income. Whichbrings us to the key question: how will you, as an organization,ensure that, once the failure springs up, youwill have thequality and the completeness of data? Essentially, we all are awareof failures, but how many of us ever tested a restoration frompreviously created AWS RDS snapshots. Not 100% of us, certainly.Personally, I’ve seen many examples of a “completely believing inmagic power of the cloud” attitude and heard many times peoplesaying: “we do not need to worry about the data, we’ve got aregular system snapshots done by AWS”. Once the failure comes up(developer deletes a tiny part of the data “accidentally”), it’salways too late for guessing. I won’t get into details and dwell onhow you feel when you have to act fast, especially when dealing witha problem you have never tested before. Let’s just say that your heart races like a speeding train.
Better safe than sorry
Up to this point, our client had RDS databases launched in 8 different AWS regions. Keeping the data consistent and being able to restore it from a particular backup in case of an outage, always top the list ofour priorities. Unfortunately, in this case there was no habit of regular and internal testing. So, the right time came with an ISO audit.
Keep it simple
Thebalance between tasks automation and time spent on building automatedflows is a question that always sparks long debates. Our team decidedto avoid developing something that would introduce additional maintenance overhead and, more importantly, one that would certainlykill our monthly billing, just because we wanted to save some time.That’s why we leveraged AWS Step Functions to close the wholeworkflow in a single place.
NOTE: For those who have never worked with AWS Step Functions, it is a serverlessfunction orchestrator that makes it easy to sequence AWS Lambdafunctions and multiple AWS services into business-criticalapplications.
If costs are fine, then you’re on the right track
Whenyou use Step Functions, you are charged based on the number of state transitions required to execute your application. After the freetier, which includes 4,000 state transitions per month,you pay $0.025 per 1,000 state transitions. Our client would accept an automated approach if a short development period could overlap with low end costs.
Let’s design the workflow
Step Functions is a great way to build and step through series of AWS services in a matter of minutes. In this case, it also addressed few concerns that our client specified as highly important from his perspective:
- new solution cannot lose its state
- dealing with errors and timeouts must be done in an easy way
(NOTE: Step Functions includes built-in retry conditions that allow us to set a number of times you want a certain function to be retried before it goes to a failed state and error handling)
- not investing much time in building the flow and operating it afterwards
- receiving a human friendly report after each test and storing it securely
- the solution has to be auditable
Below, you’ll find what we’ve developed as an initial version of the flow which, right now,is on a roadmap for future development.
IMPORTANT: All of the steps mentioned above work as Job with a try-catch approach.In a nutshell, if any of the steps from 1 to 9 fail, then it isimmediately directed to the “Destroying RDS Instance/ Slack Notification” step.
Basically, our company is full of enthusiasts, who like to take small steps budget tangible results. So, here’s what our next priorities look like:
- define more complex SQL queries for tests
- get rid of passing by credentials in point 1 and change it into something more reliable
- deal with Step Functions limit, which is 90 days. After that, you can no longer retrieve or view the execution history. There is no further quota for the number of closed executions that Step Functions retains.
Problem solving is all about seeking relatively easiest solutions, provided they exist. Honestly, I wouldn't treat leveraging serverless AWS services as a remedy for every problem. However, speaking from my own experience, it definitely allows us to quickly check the idea against the solution. With the example depicted above, I was able to provide an already working version for tests after just 2 days, and I didn’t have to think about non relevant setup issues.