Codementor Events

Backup your AWS Redshift cluster properly and sleep better!

Published Aug 24, 2018
Backup your AWS Redshift cluster properly and sleep better!

"Everything fails, all the time" - These are the famous words from the one and only Werner Vogels. Anyone who is working in the IT industry must agree with this statement of the Amazon's CTO and Vice President. Just remember, when was the last time you had an incident in production?

When was the last time your customers experienced outage using your service? When was the last time something failed and caused a loss in revenue? Things are breaking for various reasons and we all know it. It's difficult and expensive to keep our applications uptime close to 100%. What we do is trying hard to eliminate any possible failures and we have backup plans to resolve any failures promptly and with minimal impact on our customers and revenue.

In this article, I will focus on writing about how to properly back up your AWS Redshift cluster. If you are a user of AWS Redshift, then big data is probably a large part of your business and any failure or data lost without proper backup could have a big impact on your business. Since we know that everything fails, all the time, let's limit the possible impact of our Redshift cluster failure with a proper backup solution!

What is this AWS Redshift you ask?

Let's start with a brief intro to AWS Redshift so we are all on the same page before digging deeper into the backup strategies. Amazon Redshift is a fast, fully managed data warehouse. You can load up to petabytes of your data into AmazonRedshift clusters and then analyze the data using standard SQL and your existing Business Intelligence tools. Redshift is a very cost effective solution for analyzing your data and as Amazon says, it will cost you "less than a tenth the cost of traditional solutions".

Backing up your Redshift Clusters

Since Redshift is fully managed by Amazon, you will get automated backups by default, out of the box. Automated backups are enabled by default when you create your Redshift cluster. Amazon will always attempt to maintain at least three copies of the data - the original and replica on the compute nodes and a backup in Amazon S3 service (s3 - Simple Storage Service). It's very simple to manage your automated backups settings from the AWS Console itself, and we will no longer speak about this type of backup. Going forward, we will speak about backing up your Redshift cluster manually - why and how would you do that?

Why manual snapshots if we have automated out of the box?

If for any reason you delete your Redshift cluster, all of the snapshots taken automatically will be deleted as well. That means that if for some reason, for example by a human error, the cluster will be deleted, all your data will be gone. That's the main reason you should be taking regular manual snapshots as well. Manual snapshots will be not removed automatically by Amazon and also will persist in a case the whole cluster was deleted. If the cluster was deleted by mistake, you can always restore it back from the recent manual snapshot. Since there is a default limit of 20 manual snapshots for your cluster, you also need to remove older snapshots to make a room for new ones.

So how does one create the manual snapshot?

That's a good question! It's very easy to go to the AWS Redshift console and execute the manual snapshot by clicking one button, however that's not something very practical what we want to do on daily basis. Instead of that, it would be great to have a script which will do it for us. It would be even better if this script can run serverless - meaning that we will not deploy the script on any virtual machine we would need to manage, instead, we would just utilize AWS Lambda to run the script. It would be also very good if this script would remove old snapshots and notify us in a case any error occurs. Well, I have great news - such a script exists and you are free to use it for your needs!

Where can I find the script and how to start using it?

Another great question, you are on fire! The script can be found on GitHub. I will now explain the functionalities of the script as well as how to get started using it (I'm also explaining the usage in the Readme file for your convenience). I will not go into much details of setting up the various AWS services, but will rather link documentation for those. Also feel free to comment below if anything is not clear with the setup and I will gladly try to assist!

Creating manual snapshots

The function for creating manual snapshots (redshift_manual_snap) will loop through all Redshift automated snapshots and will sort them by the most recent date. Then manual snapshot will be taken for the latest automated snapshot for each of the cluster in the AWS region where this function is deployed.

Removing manual snapshots

The function for removing manual snapshots (redshift_snapshot_remover) will loop through all Redshift manual snapshots and will compare the snapshot creation date with the retention period which is set up as an environment variable (See section below). If certain snapshots need to persist, the function will need to be adjusted to exclude those snapshots, or in this case there is also environment variable called max_back which will specify the maximum time to look back for old snapshots, therefore if you need to persist snapshots taken months ago, set the value of max_back to for example to 30 so the function will not remove any snapshots taken more than 30 days ago.

SNS notifications on failures

The function to push message to AWS SNS (Simple Notification Service) (notify_devops) will publish a message to SNS topic. The purpose of this function is to notify the team about any failures in taking the snapshot. If you need help creating the SNS topic in AWS, the Getting Started page is a great resource. The SNS topic ARN needs to be then defined as an environment variable when creating the lambda function. It is recommended to subscribe an email Distribution List to this SNS topic so the correct team or service is notified about failures.

Logging

When this Lambda function is running, all the logs are written to Amazon Cloudwatch Logs. There is no setup needed for this and the only requirements is to have Lambda configured with the correct role/permissions (See below). After the first run, new Log Group will be automatically created (Something similar to /aws/lambda/YOURFUNCTIONNAME) and logs will be created after each run of the function.

Initial Setup

Lambda Function

This Lambda function should be set up in the AWS region where the Redshift cluster exists and one wish to take the manual snapshots. If you are completely new to AWS Lambda, take a look at the Getting Started tutorials for a quick training. The function should be set up as Python 2.7 Runtime and the Handler should be the main function lambda_function.lambda_handler. In order for this function to work, proper IAM role needs to be attached - this IAM role needs to have access to Redshift as well as to Cloudwatch logs. Since the function needs to read/write data, it's recommended for the role to use AWS Managed policies AmazonRedshiftFullAccess and CloudWatchLogsFullAccess - if you need help with creating this IAM role, please consult the AWS documentation or add a comment below. It's also recommended to increase the Lambda Timeout based on the environment and number and size of Redshift clusters, but 30 seconds should be fine for most cases.

Triggers

Amazon is taking automated Redshift cluster snapshots multiple times per day, usually every 8 hours or following every 5 GB of data change. This script is designed to take the manual snapshot of the latest automated snapshot for each cluster and WILL NOT take snapshots for EACH automated snapshot. Since this script is taking one manual snapshot a day, it is recommended to set up the Lambda trigger as Cloudwatch Events Schedule: For example cron(0 4 * * ? *) will invoke the function every day at 4 AM GMT.

Environment variables

The script requires 3 environment variables to be set up. The variables are to set the Retention Period, SNS topic ARN and Maximum time to look back for any manual snapshots - this is to avoid deleting any old legacy snapshots which may be still needed in the future. An example of the variable's setup can be seen here:

{
  "ret_period":"7",
  "sns_topic":"arn:aws:sns:us-east-1:123456789101:topic_name",
  "max_back":"25"  
}

Conclusion

If you made it all the way to this point, I salute you 😇 I personally used this script or slightly different flavors of it on a few various projects. The good thing is that once you do the initial setup, you don't need to ever worry about your manual snapshots. If anything will go wrong for any reason, AWS SNS will send you a notification and you can troubleshoot the error. I hope this can help some of you with achieving your Disaster Recovery goals!

Discover and read more posts from Petr Hecko
get started