Snapshot copy to another region
Before reading
- EBS — Elastic Block Storage
- DR — Disaster Recovery
- This is not considering encrypted snapshots
Tech Stack
Following this thread of automation articles, the tech stack to accomplish this tool is the following:
- AWS Lambda
- Python using Boto3
- AWS CloudWatch Events
I will not go into a lot of grainy details about each of these services since it is not the scope of the article. I do, however, want to point out that CloudWatch Events was not mentioned during my previous article, so I am going to include it since it is going to play an even bigger role when replicating these snapshots to other regions.
As for the main functions that are going to be used from Boto3, we have: client.copy_snapshots(), client.describe_snapshots(), and client.create_tags(), client.delete_snapshots().
Goal
Snapshots are point in time backups of the EBS volumes. They are very useful because they help restore data from a volume in case of a failure or outage.
Now, what would happen if the main region in which you are hosting your application goes down? This is very unlikely to happen, but it is usually best practice to have snapshots being to another region as DR solution just in case your main region were to go down. In situations like this, it is really important to have a strategy where data can be reproduced and easily recoverable.
Hence, it becomes almost essential to have an automation tool in place to do these sort of replications without wasting too much time and manpower on these activities.
Before:
After:
Logic
Before starting, there is a very important threshold that need to be understood. Due to the high amount of computational power that AWS has to copy Snapshots from one region to another, there is a hard limit of only 5 snapshots that can be copied over in a certain timeframe. Therefore, CloudWatch Events will play a big role invoking the AWS Lambda function multiple times to copy all the snapshots.
The following are the steps to be followed for this automation tool to run. Each of these steps will be its own separate lambda function. It is best to separate to separate the logic into multiple workload to have better control of the overall automation.
Step 1: Snapshot Tagging
Tagging is a key component because it will allow finer control over the snapshots to be copied. Therefore, we are going to have a tag with Key: ‘DR’ and initial value of ‘false’. This just means that the snapshot has not been copied yet. In addition, there has to be some logic behind this execution. Hence, the code will only tag snapshots that have not been tagged in the last 24 hours.
Highlights:
- days: is going to be a variable that will depend on your use case.
- describe_snapshots: will pull all the snapshots. Very useful method from Boto3 EC2
- create_tag: will create the tag for the snapshots. Another method from Boto3 EC2
Step 2: Snapshot Copy
Bread and butter function from the automation piece. This will copy all the snapshots that have been tagged as DR: false to the secondary region. As mentioned before, this function will have handle the limit by AWS in sending snapshots to another region.
Highlights:
- source: region | client: which region and boto3 configurations the snapshots are being copied from.
- destination: region | client: which region and boto3 configurations the snapshots are going to go.
- copy_snapshot: will replicate the designated snapshot from one region to the other specified region. Again, another useful function from EC2 Boto3
- DR:true tag change: the tags that had initially a Key DR and Value false are to be changed to Key DR and Value true. This would allow the code to skip these values as they have already been replicated when it runs next time.
Snapshot Delete
This is an optional step. It can be simulate a lifecycle policy (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/snapshot-lifecycle.html). This is function is particularly useful because it would delete older snapshots that might not be needed. If you had replicated 3 snapshots of the same volume that have different and incremental changes, this function can help you delete the oldest of these snapshots.
Highlight:
- policyDates: variable which is going to evaluate from how many days snapshots have to be considered before deleting. For example, 7 days minus today (May 31st) would be May 24th. So all snapshots before May 24th would be deleted. This is obviously subjected to change according to certain business needs in the policy.
- delete_snapshot: will delete the snapshots.
Step 3: CloudWatch Events
Setting up a CloudWatch Event will be crucial for replicating all the snapshots successfully. As only 5 snapshots can be sent over a certain timeframe, the copying function will have to be invoked multiple times until there is no more DR:false tags left. This is where CloudWatch Events will come in handy. Here are the steps to set up an event:
1. Go to CloudWatch Events and Create a New Rule
2. Select the frequency in a Crontab format (https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html) and select the target (AWS Lambda Functions)
3. Give a name, description, and create rule
Reference
AWS Lambda: https://docs.aws.amazon.com/lambda/latest/dg/welcome.html
AWS CloudWatch Events: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html
Boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html
Github: https://github.com/edreinoso/aws_devops/tree/master/snapshot-copy/python/