Data Architecture Design Basics
What this study covers
Where does the user authentication data go? What about our regularly read content? What about images and regularly served files?
This study is meant to make it easier for a data architect to decide on where each piece of data in his/her project is supposed to go by making decisions based on the strengths and weaknesses of each DB. It is meant to simplify a software architect’s decision process when designing a new data architecture for a project.
What this study does not cover
How much will this cost? Can I find developers that can implement my design? How much data can this design handle? How can I monitor the efficiency of my choice?
This study does not cover the cost, ease of adoption, or size of the data in the chosen databases.
If your project is small, it would be best to stick to a single database as they are all capable of handling up to 105 records of data very efficiently. Once this threshold is exceeded, you should start looking for ways to make it more efficient.
Monitoring is not addressed in this study as it is a topic of its own and each implementation requires separate considerations. Just keep in mind that as long as you set up proper indices, the DB is rarely the first culprit for efficiency problems.
Performance tests in existing research
I have based my decisions on the following papers and articles
2017: Microsoft Azure article “NoSQL vs SQL” (link)
2014: “Performance Evaluation of NoSQL Databases: A Case Study” (link)
John Klein, Ian Gorton, Neil Ernst, Patrick Donohoe
Software Engineering Institute
Carnegie Mellon University
Pittsburgh, PA, USA
{jklein, igorton, nernst, pd}@sei.cmu.edu
Kim Pham, Chrisjan Matser
Telemedicine and Advanced Technology Research Center
US Army Medical Research and Material Command
Frederick, MD, USA
kim.solutionsit@gmail.com, cmatser@codespinnerinc.com
2013: “A performance comparison of SQL and NoSQL databases” (link)
Yishan Li and Sathiamoorthy Manoharan
Department of Computer Science
University of Auckland
New Zealand
General considerations for data architecture design
-
At the design phase, measurable and immeasurable metrics impact the choices we make to optimize the efficiency of the services.
Measurable metrics are:
a. The calls that each API makes to the DB
b. The number of complex queries in the APIs
c. The number of writes and reads that each API makes
Immeasurable metrics are:
a. The frequency of each API
b. Whether or not the load on the DB will increase slowly or in spikes
c. The impact of data-locking operations on the efficiency of the system -
These can only be uncovered through monitoring, but will remain unreliable as it will depend on averaged data that will change daily.
-
Where is most of the delay going to come from? DB, webserver, or 3rd party APIs
3rd party API calls will be slower than DB or webserver functions and can only be optimized in the sense that they can be queued and asynchronously called. -
Reports slow down the entire system significantly since they are often very complicated to extract for analysis and consume much of the resources of the system. Reporting data will always require special handling in order to limit its impact on system performance as a whole and will almost exclusively be for read purposes.
Luckily, we no longer have to choose just one database to house all of our data. The main types are SQL and NoSQL databases, but there is a great degree of variance in the different available brands. Add to that the different deployment options for each DB like Master-Slaves, replication, sharding, and Hadoop-like services among others and you have a plethora of options for deciding which data goes where.
Flavors of SQL: MySQL, Postgres, MariaDB, Oracle, SQL Server
Flavors of NoSQL: Couchbase, CouchDB, MongoDB, Cassandra, Azure DocumentDB and HBase
Scale
Strictly speaking, you will have to consider all your DB options very carefully and almost all rules go out the window when it comes to very large data sets. This table is just a starting strategy to set a baseline for your design.
The important part is to monitor performance and be able to compare and test different configurations so that when efficiency starts to drop, you have reasoned alternatives to all out panic! I recommend having no more than 5 basic load tests that measure the latency of your system on its core features when under significant stress. Such as a lot of users trying to login at the same time or a lot of users trying to acquire a lock on 1 record at the same time.
Questions about the measurable metrics
Answer the following questions for your own project. I will answer them in a later section for my own project.
- What is the percentage of requests that run complex queries?
- What is the percentage of requests that run simple queries?
- What is the percentage of requests that lock tables?
- What is the percentage of requests that insert/update/delete from a single table?
- What is the percentage of requests that insert/update/delete from multiple tables?
DB Scenarios to Test
The following test cases should be run on a test environment that is similar to your expected production environment. The examples are written for MySQL.
The seed data should be as large as your expected number of records for the current architecture.
Lastly, they need to run through your backend code and not just on the DB for more realistic results.
- Query for 1 record in a single table. eg:
SELECT * FROM User LIMIT 1;
- Insert 1 record into a single table. eg:
INSERT INTO User (
Name
,
- Query multiple tables in a single query for 1 record. eg:
SELECT *
FROM User u
JOIN Preference p ON p.UserID=u.ID
JOIN UserImage ui ui.UserID=u.ID
JOIN Image i ON i.UserID=u.ID
JOIN Block b ON b.PreferenceID=p.ID;
- Query for 10,000 records. eg:
SELECT * FROM User LIMIT 10000;
- Query for 10,000 records from 4 or more tables. eg:
SELECT *
FROM User u
JOIN Preference p ON p.UserID=u.ID
JOIN UserImage ui ui.UserID=u.ID
JOIN Block b ON b.PreferenceID=p.ID
LIMIT 10000;
- Query multiple tables in a single query with indexed and unindexed criteria for 1 record. eg:
SELECT *
FROM User u
JOIN UserPreferences p ON p.UserID=u.ID
JOIN UserImage ON ui ui.UserID=u.ID
JOIN Image i ON i.UserID=u.ID
JOIN Block b ON b.PreferenceID=p.ID
WHERE u.email=’tom@hardy.com’ (Indexed)
AND p.InterestID=5 (Indexed)
AND p.CreationDate=’14/10/2016’ (Unindexed)
AND b.State=2; (Unindexed)
- Insert into multiple tables. eg:
INSERT INTO User (
Name
,
INSERT INTO Preferences (Name
,UserID
,State
) VALUES (‘Groceries’, 12, 1);
INSERT INTO Image (Name
,Source
) VALUES (‘Taurus’, ‘http://img.google.com/qn104m.jpg’);
- Query multiple tables for 1 record and insert into other tables. eg:
SELECT *
FROM User u
JOIN Preference p ON p.UserID=u.ID
JOIN UserImage ui ui.UserID=u.ID;
INSERT INTO Preferences (Name
,UserID
,State
) VALUES (‘Pets’, 13, 1);
- Select and lock multiple tables for 1 record and insert into one of them before releasing the lock. eg:
START TRANSACTION;
SELECT *
FROM User u
JOIN Preference p ON p.UserID=u.ID
JOIN UserImage ui ui.UserID=u.ID
WHERE u.ID=9001
FOR UPDATE;
INSERT INTO Preferences (Name
,UserID
,State
) VALUES (‘Pets’, 13, 1);
COMMIT;
Use Case
A project is expected to start with 107 records without a data spike upon its launch. After that, the rate of increase is expected to average 220,000 records per day for the first year.
The following is an obfuscated version of the full API of this project and all the tables that each API uses. They are grouped by Select, Update, and Insert statements. The percentage number next to each API is the expected frequency of the use of that API.
- API1 (Used by all APIs) (24%)
Insert: TableF - API2 (Used by most APIs) (17%)
Select: TableG, TableH
Update: TableH - API3 (11%)
Select: TableL, TableM, TableJ, TableN, TableO, TableP, TableI, TableQ, TableR, TableE, TableS, TableT, TableU, TableV
Insert: TableW - API4 (9%)
Update: TableA, TableB, TableC
Insert: TableD - API5 (5%)
Select: TableE, TableI, TableK - API6 (4%)
Insert: TableE - API7 (4%)
Update: TableG
Insert: TableG - API8 (4%)
Select: TableL, TableM, TableJ, TableN, TableO, TableP, TableI, TableQ, TableR, TableE, TableS, TableT, TableU, TableV - API9 (4%)
Select: TableL, TableO, TableY
Insert: TableO - API10 (4%)
Select: TableG, TableX, TableH
Update: TableG, TableX
Insert: TableG, TableX, TableH - API11 (2%)
Select: TableI, TableJ - API12 (2%)
Select: TableL, TableM, TableJ, TableN, TableO, TableP, TableI, TableQ, TableR, TableE, TableS, TableT, TableU, TableV, TableK - API13 (2%)
Insert: TableM - API14 (2%)
Select: TableL, TableO, TableY, TableQ, TableR, TableE
Insert: TableL - API15 (2%)
Update: TableG - API16 (2%)
Update: TableX - API7 (1%)
Select: TableQ, TableJ, TableI, TableX - API18 (1%)
Select: TableJ
Update: TableJ, TableG
Insert: TableJ - API19 (0%)
Update: TableG - API20 (0%)
Update: TableG
Analysis
The amount of records in this DB is 107 with an increase of 220,000 per day for the first year:
10,000,000 + (365 x 220,000) = 90,300,000 = ~10^8 records in 1 year
The expected API frequencies for this project yield 135 writes for every 333 reads.
The following tables are the ones that will be written to. The number is the percentage of API requests:
TableF: 24%
TableH: 21%
TableG: 19%
TableW: 11%
TableX: 10%
TableA: 9%
TableB: 9%
TableC: 9%
TableD: 9%
TableE: 4%
TableO: 4%
TableJ: 2%
TableL: 2%
TableM: 2%
The following tables are the ones that will be read from:
TableI: 25%
TableE: 24%
TableL: 23%
TableG: 21%
TableH: 21%
TableJ: 20%
TableQ: 20%
TableR: 19%
TableM: 17%
TableN: 17%
TableP: 17%
TableS: 17%
TableT: 17%
TableU: 17%
TableK: 7%
TableY: 6%
TableX: 5%
Recommendations
This number is well below 109 records and therefore it is a large sized DB. Our rate of increase is not expected to spike outside what an SQL DB with memcache and a NoSQL DB can handle for the first year. So the recommendation is an SQL DB with memcache and Couchbase.
We should set special handling for the top 2 most written-to tables as anything above 20% should be separated from any read tables when possible.
About the author
Abdallah Chamas is a software architect with 11 years of experience as of this writing April, 2017. He is a Scrum Master and a backend developer using Java, PHP, MySQL, Microsoft SQL Server, and Oracle. He has been a core developer of ApstrataDB and several user-oriented technology startups.
I am very impressed to see your post. and I want to share some information with you that Last Minute Writing Service emerges as a lifeline in the rush toward academic deadlines. It embodies swift expertise and reliability, ensuring meticulous attention to detail even under pressure. Behind this service lies a commitment to excellence, where urgency meets precision in every crafted sentence. For more info visit https://lastminutewriting.com/ Choosing a Last Minute Writing Service means embracing a partnership built on trust and efficiency, where complex ideas are transformed into coherent arguments at lightning speed.