Design For Scalability with NoSQL
There were times when technology companies used to have humble beginnings and had the opportunity to scale as they propelled to greatness.
But in this era of unicorns, technology startups can attract millions of users in no time. This highlights the need to design solutions for scalability from the start.
In this post, I will talk about some key challanges and solution approaches for design and implementation of a scalable solution.
Overview
The recipe of a scalable solution includes different ingredients like: compute infrastructure, memory, storage, technology selection, design and architecture until the implementation details.
However, in this post, I will mainly focus on one of the most important aspects, i.e., data storage and retrieval. Also, I will show how we can take advantage of NoSQL datastores to simplify our business logic and application code complexity.
Let's get an example of a Social Networking platform like Facebook that requires highly scalable solutions to accomodate millions of users and rich sets of features. We need to build a system where users can create profiles, connect with friends, create posts, and create notifications to stay updated with happenings in the network and find nearby friends or connections.
Data Storage and Retrieval
Traditionally, most people were using RDBMS (Relational Database Management Systems) for almost every situation. But in the current technological landscape, we have more than a few options to store and retrieve data efficiently. Now the challange is picking the right data storage option for a solution.
There are different categories of NoSQL databases, including: Key-Value Stores, Document Stores, Column Family Stores, Graph Stores, and more. Every NoSQL database has its own use case where it performs best. So it is very important to select the right database for the job.
We have many features in our example social network platform, and every feature has its own data storage and retrieval needs. Instead of selecting a single database solution for the whole system, we will use right database for job.
Technology Selection and Data Modeling
So the first step is selecting the right database and data model for each area of the social network. For this, we need to understand the true nature of data invovled in each scenario and then evaluate data storage and access patterns to define most suitable data model.
Let's define the nature of data for each feature, map it to a respective category of the NoSQL database, and then define the data model that is most suitable for each social network scenario.
User Profile
The user profile consists of mandatory and optional information about a user. Some users prefer to provide more information than others, so schema should be flexible to accomodate this fact.
The basic profile is well structured, like: name, age, gender, email, and phone number, while the extended profile is more descriptive and less structured, like: hobbies, short description, favorite movies, goals, achievements, etc.
Normally, the basic profile is searchable so that other users can search a specific user by email, phone, or name, and narrow down the results using age and gender filters.
However, the extended profile is just a blob of dumb information displayed when someone views the user profile.
Based on the above parameters, user profiles can be viewd as a document that is searchable based on some fields in the document. Other fields in the document are just informational and in most cases, the whole document is accessed as whole.
We can use a document store like MongoDB or Azure Cosmos DB (MongoDB API or SQL API). A sample profile document (just a few fields) will look like:
{
"_id" : "A1B2C3D4",
"fullname" : "Jhon Doe",
"email" : "jhon@example.com",
"gender" : "male",
"Birthday" : "23/06/1983",
"country" : "Australia",
"hobbies" : [ "Reading", "Cooking”],
"about_me" : "I am a software engineer living in western Australia and ...",
"favorite_movies" : [
{"name" : "Transformer", "category" : "Scify", "rating" : "good"},
{"name" : "Titanic", "category" : "Romance", "rating" : "good"}
]
}
So when we need to display a user's profile page, we do not need any joins or complex processing. We can just fetch the specific profile document from the user's document collection in a single call to datastore and present it in a formatted template. The query will look something like:
db.users.findOne( { "email": "jhon@example.com" } )
The above design is workable, but we have to create an index on all fields that are searchable in the document so that we can get decent performance when someone searches for users by providing filters (e.g., list all male users living in Australia between the ages of 18 and 22).
We can also introduce a new database into this equation to support fast and flexible searching based on multiple fields. For this purpose, we can use Elasticsearch and maintain an index of all searchable fields.
When a new profile document is inserted or updated in MongoDB, we also send selected fields to Elasticsearch for maintaining a high performance index. When someone tries to find a user based on given criteria, we can first hit Elasticsearch to quickly look up that user and then fetch detailed profile documents for that user from MongoDB.
Friends and Connections
The friends and connections in social networks are already textbook examples of graph based structures, so we can quickly decide to go with a graph database.
For this purpose, we can use a solution like Neo4j, OrientDB, or Azure Cosmos DB (Graph API). In this case, users are stored as nodes or vertices and connections are made using relations or edges.
This will help us write simple and efficient queries for complex collaboration scenarios. For example, we can find all "friends of friends" for user Jhon using a simple gremlin query:
g.V('Jhon').outE('friend').inV().hasLabel('user').outE('friend').inV().hasLabel('user')
The graph database is not a good fit for every scenario, but when the nature of data is graph-like, then there is nothing more efficient and simple than graph databases.
Creating Posts
Posts created by users are also like documents with titles, descriptions and maybe some images or video (images and videos should be stored at some blob storage or CDN while their URLs will be part of post document).
Then we also need to allow comments and reactions (like, dislike) on these posts.
If we evaluate the posts' use case and access pattern, we can identify the following requirements (not exhaustive requirements but are just fine for our example):
- Posts will grow exponentially as users and engagement increases
- A good post can attract 100s or 1000s of comments
- A popular post can attract 1000s of reactions (likes, dislikes)
- Users interact with the new post but as the post gets older, its access frequency is decreased
To start with, we can use a document store like MongoDB or Azure Cosmos DB (MongoDB API or SQL API) for posts. It is possible to put post comments in same document, but as comments against a post can grow to 1000s, it will become problematic.
There are limitations on maximum document size in document store. On the other hand, we do not need to show all post comments along with posts every time.
Maybe we can just show post and fetch comments only when the user explicitly wants to view comments (like Facebook, Twitter, or others).
Also, if we look closely, post comments are also similar to posts (comments can have text, images, likes, dislikes and sub-comments). We can also treat comments as posts and use a parent_id field to link posts and comments. If parent_id is null then it's a post. If it contains an identifier of a post, then it's a comment against that post.
Futhermore, considering the contineously growing data, we can use data partitioning concepts such that we create data partitioning based on time.
Every few days or weeks, we switch to a new partition and our system can scale infinately.
Please note this is just for a simple example and there is more complexity involved in this scenario, but that is out of scope for this high-level discussion.
For example, if we create time based partitions, then new posts will be in newly created partitions and older posts in old partitions.
With time, activity on old posts decreases and older partitions will not be accessed frequently, while new partitions will get more hits. This will create hotspots in newer partitons, which is not ideal.
There are solutions for every problem, and we can discuss that in some other article.
We can store a summary of comments and reactions in the post document or a separate document because we just need to show summarized numbers (1000 comments, 5,234 likes and 2,342 dislikes).
The details of each comment and reaction must go in a separate document that we will retrieve only when the user demands. Our queries are simple and efficient like:
// Get complete document of a specific post
db.posts.findOne( { "post_id": "abc123" } )
// Get all comments of a specific post (when user request to see comments)
db.posts.find( { "parent_id": "abc123" } )
Notifications
The notifications are an essencial part of a social network to keep users engaged. The notifications are generated when there is an activity in the network, like someone liking a photo, creating a post, or adding a comment on a post etc.
Let's try to understand the notification requirements:
- Number of notifications generated in social network incrases as users and engagement increases
- Number of notificaions generated per sec can grow to 1,000s or a lot more
- A single notification doesn't contain lot of data
- Notifications are relavent for some time, then become irrelavent (I want to get notified if someone commented on my post, but that notification is of no use after a month or year)
- A single action can generate one notification or one million notifications. For example, when a user who has two friends performs some activity, we just need to notify the user's friends (two notifications). But when Elon Musk performs some activity, then we need to notify millions of fans on social networks who follow Elon Musk.
- It is not required to make notifications searchable and we just need to access the notification in time order.
Taking the above requirements into consideration, we can start to use a column family datastore like Cassandra. With the Cassandra database, we get insanely fast write speed and if we use correct partitioning scheme, then we can retrieve latest notifications for a user amazingly fast.
Even if we get millions of notifications every second, we can linearly scale our datastore by adding more nodes (due to that fact that Cassandra provides active-active high-availability clusters using consistent hashing).
Also, we do not need to purge old/useless notifications explicitly and we can set TTL for a record so that it is purged automatically (Cassandra's compaction algorithm takes care of it implicitly).
From a data model perspective, we can partition the data based on unique user identifier and use decending timestamp as cluster index. This will ensure that notifications for a user are stored in single partition and it's implicitly stored in the order that we need. Our data model will look something like:
CREATE TABLE notifications (
user_id text,
message text,
creation_time timestamp,
PRIMARY KEY(user_id, creation_time)
) WITH CLUSTERING ORDER BY (creation_time DESC);
// Adding a new notification to table with TTL (expire after 90000 sec)
INSERT INTO notifications (user_id, message, creation_time) VALUES (usr123, 'Jhon commented on your post', '2018-04-01 20:05-0700') USING TTL 90000;
Please note that the above data model is just for example and an actual notification model should have some more fields like: who generated the notification, what action was performed, and other related information.
Nearby Friends
Lets talk about our last feature in our example social network, where the user can check-in to different locations and can check if there is a friend or connection nearby (who also checked in to that location recently).
Here we need to define locations and track when a user checks in to that location. This information is only valid for some time and will be purged when the user checks out or maybe after an hour. Also, we need to find all the common people who exist in a user's friend list and also checked in recently at a given location.
An interesting way to solve this is by using a Data Structure store like Redis. Many people use Redis as a key-value cache (as its in-memory datastore), but Redis can do a lot more than that.
Redis database provides data structure storage so in addition to key-value storage, you can also store lists, queues, sets, priority queues and many more data structures in Redis.
In this case, we can create a SET in Redis for each location and whenever some user checks in to that location, we can add the user's identifier to that SET.
Also, we can keep another SET for each user that contains a list of all friends and connections for that user. Whenever we need to find nearby friends, we can just take the intersection of boths sets and it will give us the friends nearby.
//Jhon checks-in to Golf Club
SADD locations:GolfClub:active jhon@example.com
//Nina checks-in to Golf Club
SADD locations:GolfClub:active nina@example.com
//Nina want to check if some of her connections are nearby
SINTER users:nina@example.com:connections locations:GolfClub
In above example users:nina@example.com:connections will be a Redis SET that contains friends or connections of Nina.
We also need to make sure that we purge old check-ins so we can use an active SET (i.e., locations:GolfClub:active) and an old SET (i.e., locations:GolfClub:old).
Now we can use a small job that runs every 30 minutes, move active to old, and then update the main SET (i.e., locations:GolfClub) with a union of active and old sets. This will keep last hour check-ins in the main SET. The job will run the following code on Redis:
// Rename active SET to old (overwrite previous old)
RENAME locations:GolfClub:active locations:GolfClub:old
// Take union of current active and old into the main SET
SUNIONSTORE locations:GolfClub locations:GolfClub:active locations:GolfClub:old
We can use the right datastore that naturally supports our data model and avoid complex calculations on the application layer.
Conclusion
The selection of the right datastore can not only help in scalability and performance, but can also simplify complex computational scenarios. It is very important to evaluate the true nature of your data (for each feature/scenario) and then select the right datastore that natively supports it.
Always evaluate and understand the data access patterns in your application before you define the data model. And, last but not least, data partitioning is the key to horizontal scalability in NoSQL datastores.