Postmortem: Migrating MongoDB to DynamoDB
Introduction
DynamoDB, a relatively new arrival to the NoSQL party, celebrated its three-year anniversary earlier this year. We have now seen it deployed in mature products like the portfolio of online games at TinyCo and our own app store optimization solution at Gummicube. It's pay-as-you-go and extremely scalable, with basically zero administration overhead. However, it does have some uncommon limitations in the schema design.
I completed a series of migration from MongoDB to DynamoDB earlier the year and encountered both roadblocks and successes. Here's a postmortem on what went down, and hope you'll find this write-up useful.
Background : Why Migrate?
Our web application is written in Meteor.js, which uses MongoDB by default, and we have been storing all of our application data that way since day 1. However, as our service grew and more data is collected, it became obvious that we really have two types of data.
On one hand, we have the kind of data that is directly used to power our web application - things like users and the contents they create on the system. But we also have a huge collection of search data that have billions of entries. The two types of data are accessed differently, and it makes more sense to move the massive search results to a different database that scales easily and can be optimized separately from the rest of the data.
Success: Reduced System Administration
DynamoDB is a fully-managed database, and we no longer have to deal with setting up monitoring tool, handle scaling, dealing with system and security updates... the list goes on. Less time managing server means more time writing code and building product.
Success: Archiving Old Data
We know that the more recent results are accessed more often than old results. Eventually very old data that the application no longer queries against (say, data over 2 years old) can be taken off DynamoDB and moved to a separate, slower storage like Amazon S3 or Glacier.
So instead of one giant table, we made a separate table for every single month. For example, we have the following table names:
Keyword_Search_Result_2014-12
Keyword_Search_Result_2015-01
Keyword_Search_Result_2015-02
...
Then we add in the logic at the application layer to query the correct table. Granted, this could have been done in MongoDB also, separating the search results into different collections based on the month they are recorded. But the setup in DynamoDB has a few more advantages as we will see.
Once the tables are set up, we can set the read and write throughput on the tables according to how often they are accessed. You can use the consumed read capacity and write capacity from the AWS web console to help you determine the throughput you need for things to run smoothly.
Utilize Burst Throughput
According to DynamoDB documentation, "DynamoDB currently reserves up 5 minutes (300 seconds) of unused read and write capacity". This allows us to turn the historical month capacity to a really low volume, since the access pattern on those historical data tends to come in very seldom bursts.
The official documentation advises against "designing your application so that it depends on burst capacity being available at all times". However for us, this is a use case where occasional slow down for accessing historical data is acceptable for our end-user experience, so we were able to benefit here from the burst throughput.
DynamoDB Auto-Scaling
There are also a some open source projects that aims to create a auto-scaling solution for DynamoDB. For example, we can lower the capacity in the evenings since fewer clients are on; then bring it back up during the day. You can check out Dynamic DynamoDB on the AWS official blog for more detail.
But keep in mind - you CANNOT decrease capacity on a table more than 4 times in a given calendar day.
Issue: Handling Multi-Key Queries
One of the first limitation on DynamoDB you will see is index. You can either have one hash key, or a hash key + range key combination. In short, there's no support for multiple-key index.
Our billions of search result entries in our system look something like this:
{
keyword : "codementor",
platform : "iPhone",
country : "US",
results : "...",
created_on : 2015-01-01T10:00:00Z
}
And our most common query on this data in MongoDB matches against keyword, platform, country, and created_on. For example: give me the search result for the keyword "codementor" on iPhone in US, on January 1st, 2015. How the heck do we model this in DynamoDB?
Solution: Combine Fields, Split Out Tables
Even though we do need to query using all four fields, these four fields are not created equal. country and platform are basically an ENUM, where we have a small limited set of supported values such as "iPhone", "iPad", and "US", "CA", etc.
We decided to merge country and the keyword together into one field, since the application never performs a query on the search result with just country but no keyword, or just keyword and no country. We then split out the platform at the table level since there are not very many platforms out there (sorry, BlackBerry). Having more table also helps us to get more control over capacity planning for different data as well, like I mentioned earlier.
In the end, we ended up with table names like:
Keyword_Search_Result_iPhone_2015_01
Keyword_Search_Result_iPad_2015_02
...
and each of the search result looks something like this:
{
keyword : "codementor__US",
platform : "iPhone",
results : "...",
created_on : 2015-01-01T10:00:00Z
}
...
{
keyword : "codementor__CA",
platform : "iPad",
results : "...",
created_on : 2015-01-02T10:00:00Z
}
NOTE: I still end up leaving the platform field in the data, even though the platform can be inferred by the table it comes from. Remember, database de-normalization can be your friend in NoSQL.
The "unpacking" of the country code and the selection of table are then tackled at our data access layer of the codebase. By the time it gets to the rest of the code, we were dealing with the same JSON object as we were back with MongoDB.
A word of caution: this particular solution worked for us, because it matched our use case and our product. It's not meant to be a guideline, but more as an example how you may need to think outside of the box to get DynamoDB to work the way you need to.
Issue: No Native Date Object
Another glaring issue is the fact that DynamoDB does not support native Date or DateTime object. Common practice is to convert it into linux timestamp and store a number instead:
{
keyword : "codementor__US",
platform : "iPhone",
results : "...",
created_on : 1437764250
}
And we'll just always convert date to timestamp at the application layer before querying DynamoDB. In my particular case, I can even make created_on a range key, so I can even get sorting and what not. Problem solved right?
Gotcha: Querying Dates
Turns out - not quite. Our application typically queries for results on a particular day, and this means that I actually need to do a Query instead of a GetItem command. For this MongoDB find statement:
.find({
"keyword" : "codementor__US",
"created_date" : new Date(2015, 0, 1)
});
I need to do this Query command on DynamoDB:
"keyword" : {
"AttributeValueList" : [ { "S" : "codementor__US" } ],
"ComparisonOperator" : "EQ"
},
"created_date" : {
"AttributeValueList" : [
{ "N" : 1420070400 },
{ "N" : 1420156799 }
],
"ComparisonOperator" : "BETWEEN"
}
where 1420070400 and 1420156800 are 2015-01-01T00:00:00 and 2015-01-01T23:59:59 in linux timestamp, respectively.
Yes, it can be queried, but there are two problems.
First, Query commands are much slower than a straight GetItem command, where you are given both the precise hash key and range key to match.
Second, DynamoDB offers a BatchGetItem, which is useful for getting search results for multiple keywords which happens often in our system. The overhead in each of the API request to DynamoDB can really add up when dealing with the number of keywords our applications are requesting. We needed a different solution.
Solution: Store Date as String
After confirming with use cases that we only need to store one search result PER day, we decided to store just the formatted date as a string:
{
keyword : "codementor",
platform : "iPhone",
country : "US",
results : "...",
created_on : "2015-01-01"
}
Now, we can get our data more quickly with a GetItem call:
"keyword" : {
"S" : "codementor__US",
},
"created_date" : {
"S" : "2015-01-01"),
}
...which allows us to fetch results in batches as well. Now, things are running blazingly fast. There's also an added bonus - the data is now human readable when using the DynamoDB web console, which also saves developer some time
I don't know if saving string use a few more bytes than saving numbers, but if it did, it has not made much of a noticeable difference even for our billions of records.
Summary
DynamoDB looks a lot like many other NoSQL solutions, but there are some significant design limitations in exchange for its zero maintenance and effortless scaling.
You may have to get creative in designing your tables and break out of old paradigms to get the job done. But now that it's done, I have to admit that I enjoy spending less time maintaining and configuring our MongoDB cluster, because that frees up my time to focus on actually building our product!
What did you use to migrate from Mongo -> Dynamo? I would love to get our data migrated and start running some tests…
Great article!
To me the AWS service that is closer than MongoDB is … elasticsearch! ES supports same types than MongoDB, offers lot of ways to query data (filter, geo, full text, datetime…), and have replication/sharding built in.
Great info about the timestamp problem. Saved me tons of time!