July 25th | Mastering IT resource discovery and management | Webinar (pl)
November 7, 2021
August 9, 2019

Using the saga pattern with AWS Step Functions

Failures — same story over and over again.

Karol Junde

We’re all aware of the fact that everything fails from time to time. No matter what type of failure we have to deal with, its aftermath is generally a pain in the ass. This is especially true for distributed microservice ecosystem when particular request has to cross multiple bounded contexts (microservice with their independent databases). And even doubly so when we take into consideration a serverless methodology of application development. In the vast ocean filled with tiny Lambda functions, it’s pretty easy to come across a failure. Moreover, problems appear in connections between Lambda functions and other AWS services.

For example, DynamoDB sets limits on reading and writing activities. When this limit is exceeded, there is a penalty either in the form of increased AWS monthly billing or rejected requests. The latter one requires some handling and counter-reaction if, simultaneously, other functions have changed the data.

Build automated workflows — AWS Step Functions

Step Functions allow the user to define a state machine using Amazon States Language (ASL), which is a JSON object that defines the available states of the state machine, as well as the connections between them. Reading JSON may be painful for some of us, therefore, AWS generates a nice looking flowchart from our ASL code which allows us to better visualize the machine, as seen here. We are going to use it as an example in this article:

Step Functions — example

Step functions service gives us an opportunity to build activity flows (not to be mistaken with state machines) that are really helpful when the transaction-like request has to be implemented. What I have in mind can be explained by Wikipedia’s definition of “atomic transaction”:

“An atomic transaction is an indivisible and irreducible series of database operations such that either all occur, or nothing occurs. A guarantee of atomicity prevents updates to the database occurring only partially, which can cause greater problems than rejecting the whole series outright.”

As I’ve mentioned before, these activity flows consist of small steps (Lambda functions, Waiters, Choices, etc.) which create a one common logic flow. If such flow is putting something into S3 and simultaneously saving some metadata (like in our example), then it won’t necessarily need to complete half of the operations successfully. Loss of consistency is completely unacceptable. That’s the gap where Saga Patterns do their job.

Saga Patterns — dealing with long-lived transactions

I found this definition of a saga: “A saga is a sequence of local transactions where each transaction updates data within a single service. An external request corresponding initiates the first transaction to the system operation, and then each subsequent step is triggered by the completion of the previous one.”

Additional notes from me: Saga is a failure handling pattern, so when any failure occurs during one of those long-live flows, we apply the corresponding compensating actions to return to the initial state when the saga/activity flow started.

In Chaos Gears we tend to use two scenarios: a sequential one and the parallel one, depending on needs.

Saga Patterns — sequential scenario
Saga Patterns — parallel step scenario

The diagram pasted at the beginning of my article covered the parallel scenario, containing a step with two Lambda Functions (DynamoDBFallback, BucketPathFallback) which generally are the compensating mirror reflections for CreateDeploymentS3Path and SaveDeploymentInfoDynamoDB. Forget about the names which have been changed for the sake of this article. I hope my dear readers have already got the point. Whenever you code a Lambda Function which is going to be used in a workflow to change the state, always think about keeping the consistency in case of failure. I don’t have to remind that keeping the idempotence in such scenario is obviously a must-have.

In case of failure
In case of success

Those of you, who follow our blog, should already know that in Chaos Gears we rely on the Serverless Framework when launching serverless environments. For us, the massive benefits of this framework lie in the plugins.

Basically, you don’t have to code everything from scratch. Just remember that the devil is in the details, so be cautious.

One of the plugins is for AWS Step Functions configuration. Believe me, it eases the pain caused by building complex flows. I prefer to read YAML rather than JSON, which is more human-readable.

Implementation — save your priceless time

Below, you’ll see the example of a flow describing the diagram shown at the beginning of the article. I want to draw your attention to the types of the “Parallel” states which allow you to invoke several Lambda functions simultaneously. Whenever one of them fails, the whole Step is treated as a failed one, and Fallback procedures (compensating transactions) are launched.

Where to go next

Establishing a consistency and maintaining it across services and with their databases is the main challenge you face, when you design and develop serverless architectures. It’s almost impossible to handle that task without saga patterns. But let’s make something clear — AWS Step Functions won’t solve all of your problems and won’t fit in every serverless scenario. However, this service offers a pleasant way to simplify the complexities of dealing with a long lived transaction across distributed components.


AWS Step Functions
AWS Step Functions


Remaining chapters

No items found.

Related articles

Let's talk about your project

We'd love to answer your questions and help you thrive in the cloud.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We'd like to keep improving our site - and your anonymous analytical cookies would help with that. Is that OK with you?
These items help us understand how our website performs, how visitors interact with the site, and whether there may be technical issues. The information we collect for this purpose is fully anonymous.