knowledge hub

Disaster recovery - still not a universal vaccine in XXI century

Disasters usually happen at an inconvenient time, with no warning. Invest your time!

Author
Karol Junde

No fear of the outage

We know that any given day part of our infrastructure may simply go down and some how we got used to that feeling, and adapted to live with it. Just take a look at the global leaders’ outages posted under the following links:

https://www.zdnet.com/article/microsofts-march-3-azure-east-us-outage-what-went-wrong-or-right/

https://www.zdnet.com/article/google-details-catastrophic-cloud-outage-events-promises-to-do-better-next-time/

https://aws.amazon.com/message/41926/

https://www.crn.com/news/cloud/ibm-blames-massive-cloud-outage-on-third-party-network-provider

And the most recent one from OVH:

https://www.searchenginejournal.com/ovh-data-center-fire-darkens-thousands-of-sites-worldwide/398485/#close

One conclusion comes to mind. They tend to fail from time to time. Sometimes it’s the whole region, another time just a particular service, like the AWS Kinesis described below:

https://aws.amazon.com/message/11201/

Nothing will change in this area, even though the aforementioned cloud providers are multiplying their activities to predict any potential outages, and constantly raising the quality of the services to the very top. Certainly, all those precautions won’t prevent the mistakes from happening. Providers are going to make them as well as us, their customers. Yet,though we know it all too well, we keep acting like fools and complain, putting all the blame for unsuccessful or partially inefficient businesses on providers and service out ges. My first assumption is that we tend to forget that our environments evolve from day one, and what we considered as adequate and sufficient at the start is adequate no longer.

Face some facts

Cloud providers give you equipment. Some of them even share the responsibility with their customers, when it comes to certain parts of the infrastructure. What does it mean for you? The answer is simple. By picking up your favourite vendor, you are presented with the variety of services, from simple backup tools to full disaster recovery templates just waiting to be implemented, that help you protect yourself from failures. The difference between those providers comes down to the number of out-of-the box services. Of course, if you have enough time and resources you can code your own tools, but let’s leave that option out for now. For me, reinventing the wheel always felt like a waste of time, but you are free to have your own opinion.

Where should I start from with my provider?

Based on my experience, I’ll use AWS as an example, but I’m sure you’ll find similar counterparts provided by your selected “business cloud holders”.

Once you accept that anything can go wrong with your environment, try to answer a few questions:

  1. Does my business force a zero-downtime?
  2. Does my business allow a disaster recovery scenario in which the data is being restored after the outage?
  3. Does my business allow a disaster recovery scenario in which resources are being deployed after the outage? (more tighten RTO)
  4. Does my business allow only a disaster recovery scenario where already deployed resources are being started and scaled after the outage? (more tighten RTO)
  5. Does my business require me to have already implemented and working emergency resources, so I can scale them up when experiencing an outage?
  6. Have I ever walked through the entire disaster recovery plan?
  7. Do I have an up-to-date, documented recovery plan for each of my cloud services?
  8. Do I have an up-to-date, documented recovery plan for my entire region?
  9. Do I have an up-to-date, documented recovery plan for my entire Availability Zone or Data Center if necessary?
  10. Do I constantly improve operational excellence through regular chaos engineering outages simulations?

AWS-based definitions:

Region is a physical location around the world where AWS clusters data centers. AWS calls each group of logical data centers an Availability Zone. Each AWS Region consists of multiple, isolated, and physically separate AZ's within a geographic area.

Availability Zones are distinct locations within an AWS Region that are engineered to be isolated from failures in other Availability Zones.They provide inexpensive, low-latency network connectivity to other Availability Zones in the same AWS Region. Each region is completely independent.

By answering and rating these questions you will get a clear picture of where on the operational excellence map you are right now, and whether you will sleep well when outage occurs.

To make it clear, the idea behind this article is not to judge the cloud providers but rather to advise you on what your team should take into consideration.

Dream big, but start with small steps

Every company that uses AWS services, from small startups to big enterprises, must consider region outage as a potential problem.However, preparing for such failures requires us to be prepared for particular service outages in advance. Start with simple steps.Prepare a list of AWS services you’ve either implemented or areabout to implement. Your workloads’ data will require a backup strategy that is going to run periodically or is/will be a part of a continuous job (for example a pilot light scenario, which I’ll cover in the next section). How often you run your backup will determine your achievable recovery point. If some parts of your cloud environment don’t have such strategy, list those blind spots and add necessary actions to your roadmap. Some AWS services may not have an out-of-the box protection or backup feature,so you might have to introduce a development overhead to your team to prepare a customised solution.

If either of the following services is a part of your infrastructure:

  • Amazon EC2 instance
  • Amazon EBS
  • Amazon RDS
  • Amazon DynamoDB tables
  • Amazon EFS
  • AWS Storage Gateway volumes
  • Amazon FSx for Windows File Server and Amazon FSx for Lustre

Then definitely check AWS Backup service(https://aws.amazon.com/backup/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc). It will easily enable centralized and automated data protection by taking care of the backups, even across regions. By having a backup in another region, you set yourself on the right path toward realizing a big dream of being prepared for a regional outage.

Let’s see how your company can deal with that.

Low cost “Pilot light” scenario - a good first step

Each financial department meticulously protects its company from additional costs, especially those which don’t generate an income.By selecting the pilot light approach, the data is being replicated from one region to another and a copy of your core workloadinfrastructure is provisioned. The only services that are always onin the “backup” region are databases and object storage. However,other services like the EC2 instances are preconfigured with theapplication code, mandatory configuration and dependencies turned off and only launched when the fail over or test is invoked.

Ifyou compare this approach to the backup and restore one, you’ll see it has a preprovisioned infrastructure all in place, waiting to belaunched and scaled out for the traffic needs. Of course, spinning upthe whole environment takes time but unlike in the case ofrestoration from backups where different unexpected issues might come up, the risk of recovery failure is pretty much limited, and you still save money by minimizing active resources, which should make you CFO satisfied.

Highers costs = “Warm stand by” scenario - a shorter recovery time

Fora great number of companies a “pilot light” scenario hasliterally no value for their businesses. If you’re a big enter prise or a matured, recognizable startup with millions of users, you wantyour recovery time after downtime to be as short as possible. This isthe point where a “warm standby” scenario comes into play.Generally, it relies on a fully functional, scaled down copy of yourproduction environment set in another AWS region. Contrary to the pilot light, the workload is always-on in a warm standby scenario,which gives a solid base for performing tests, and strengthens team’s confidence in its ability to recover from failure.

For those who can’t spot a difference between pilot light and warmstandby scenarios, here’s how I see them. For me, the first one cannot process any requests without launching instances and non-coreparts of your infrastructure, where as the second one is capable ofdealing with the traffic but at the reduced capacity level (requiresscaling up). Of course, you shouldn’t be surprised that you will becharged extra for that “ready to go” backup infrastructure, which for many is still a mental block. Nonetheless, if you happen to be ina group of matured companies, you really should consider or even start the next, more complex scenario - “multi-site active/active”.

Highestcosts = Highest protection - “Multi-site active/active” scenario

Essentially,it’s an extension of the warm standby case. In this particular scenario, users are able to access workloads in all of the regions they are deployed in. Although it sounds amazing, it’s the mostcomplex and most expensive scenario. But is has a huge benefit, as the recovery time is reduced to near zero for most disasters.

NOTE: I used the word “most” deliberately because issues like data corruption are generally based on backups which obviously introduce a delayed recovery point.

It’s really worth mentioning that in the active/active model there is no such thing as failover, simply because your workload is being served from more than one region. Does it mean that there shouldn’t be anyregular disaster recovery tests planned? To avoid situations where all work goes to waste, your team has to focus on how the workloadreacts to loss of a whole AWS region.

You have to ask yourself - Is traffic properly routed away from thefailed AWS region to another available (active) one that is meant tohandle rerouted traffic? By selecting active/active scenario, you choose to maintain a near zero recovery time from region outage and proper traffic rerouting. This is the place where most of your trials, improvements and time will have to be spent.

Even if you have an active/active well-designed infrastructure, you will still be exposed to the biggest risk there is. Human error. We allmake mistakes and nothing can change that, so instead of treating them as a negligible fact, put tests designed to eliminate errors ontop of your priorities list.

Don’t forget about human errors

Essentially,the vast majority of our customers are more concerned about human errors than services or regions outage. By human errors I mean, forexample, unwanted deletion or modification of an s3 object, or“accidental” dropping of RDS table. In first case, you shouldprotect yourself by leveraging object versioning, which protects yourdata in S3, from the consequences of deletion or modificationactions. Basically, it can be achieved simply by retaining theoriginal version of the object before the action. It’s worth addingthat if you are using S3 replication (and you definitely should) toback the data up to your backup region, then by default, which is notthat obvious, when an object is deleted in the source bucket, AmazonS3 adds a delete marker only in the source bucket. In other words, ifyou delete an s3 object from your bucket by accident, you’ll still have the backed up object.

What should be looked more closely when using AWS?

We walked through the cases of building disaster recovery scenarios.Now, I’ll share some points that may serve as good topics forinternal brainstorming:

1. Follow the IaC (Infrastructure as a Code) approach to avoid the lack of auditability of changes, and to avoid the lack of inconsistency in terms of configuration in multi-region deployments. Besides well-known Terraform and CloudFormation, there’s a new player in town – CDK – that allows defining IaC with familiar programming languages. It’s a pretty cool tool but for some might regard it as a bleeding-edge solution.

2. AWS ROUTE 53 supports geoproximity, failover or latency based policies for routing your customer’s requests in multi-region deployments.

3. DATABASES and their global tables (DynamoDB, Aurora) feature a selected scenario a/p or a/a that has a direct impact on your READs/WRITEs design:

    a. with a/p - writes occur only to the primary Region, same as READs

    b. with a/a:

       i. READs - common scenario is local read, hich means that data is served from the closest region to the customer

        ii. WRITEs:

              1. write global - where writes are being routed to a single region, and in case of failure, another region is promoted as a primary one - good example is Aurora                  global database

              2. write local - similarly to READs, routes WRITEs to the closest region - good example are DynamoDB global tables

Good intentions are not enough – It’s all about the internal well-defined habits

Unfortunately, being aware of the danger and having an already provisioned backup region in any of the aforementioned scenarios won’t make your sleep better. Evaluation of your cloud environment, through teams’ internal development or changes in dependencies among applications’ components, introduces new points which might not recover properly when outage happens. Therefore, to build an operational excellence you have to define and implement internal habits of periodical tests. Right now, I’m not using chaos engineering terminology deliberately, just to avoid opening a Pandora’s Box for sceptics. All I’m saying is that when you work with a regularly changing cloud environment, you simply can’t expect to get successful failovers without well-defined procedures, internal culture and, last but not least, regular tests. Remember, disasters usually happen at an inconvenient time, with no warning. Invest your time, you won’t ever regret that.

Technology Stack

AWS CDK
AWS CDK
Amazon RDS
Amazon RDS
Amazon Route 53
Amazon Route 53