My Thoughts on Synthetic Data
The Idea
Perform an analysis gauging whether synthesizing data offers an improvement over relying on a limited amount of "organic" data.
Motivation
I am quite skeptical of the effectiveness of synthetically generated data — a predictive model can only be as good as the dataset used to train it. My skepticism ignited a desire within myself to investigate these intuitions using objective investigation.
Prerequisite Knowledge
Readers of this article are expected to be at an intermediate level of understanding of machine learning related ideas and should already be familiar with the following topics in order to get the most out of this article:
- Basic statistics knowledge, such as the meaning of the term “standard deviation”
- Familiarity with Neural Networks, SVM, and Decision Trees (if you’re only familiar with one or two of these, that’ll probably be okay)
- An understanding of basic machine learning terminology, such as the meaning of "train/test/validate split"
Background on Synthetic Data
The two common methods of generating synthetic data are:
- Drawing values according to some distribution or collection of distributions
- Agent-based modelling
In this study, we are going to be inspecting the first category. To help solidify this idea, let's start with an example!
Imagine you were trying to determine whether an animal was a mouse, a frog, or a pigeon given their measured size and weight but you only have a dataset with measurements for two of each animal. Unfortunately, we won't be able to train a very good model with such a small dataset!
An answer to this problem is synthesizing more data by estimating the distributions of these features. Let's start with the frog for example (referencing this Wikipedia article and only considering adult frogs).
The first feature, their average length (7.5cm ± 1.5cm) could be generated by drawing values from a normal distribution with a mean of 7.5 and a standard deviation of 1.5. A similar technique could be used to predict their weight. However the information we have does not include the typical range of their weight, just that the average is 22.7g. An idea would be to use an arbitrary standard deviation of 10% (2.27g). Unfortunately, that is just the result of pure speculation and so is likely inaccurate.
Given the availability of reliable information related to their characteristics and since it is quite easy to distinguish between these species based on these features, this would likely be enough to train a good model. However, when you move to more nuanced systems with more features that you also know less about, synthesizing useful data becomes significantly more difficult.
The Data
This analysis uses the same idea as the analogy discussed above. We will create a few datasets with 10 features. These datasets will contain two different classification categories with an equal number of samples for each category.
"Organic" Data
Each category will follow certain normal distributions for each of its features. For example, the first feature will have a mean of 1500 and a standard deviation of 360 for samples from the first category and a mean of 1300 and a standard deviation of 290 for samples from the second category. The distributions for the rest of the features can be found below:
Feature | Mean 1 | Std. Dev. 1 | Mean 2 | Std. Dev. 2 |
---|---|---|---|---|
1 | 1500 | 360 | 1300 | 290 |
2 | 5924 | 1200 | 5924 | 1500 |
3 | 9.2 | 3 | 12.1 | 3.8 |
4 | 29 | 12 | 68 | 15 |
5 | 18 | 4.2 | 18.5 | 4.0 |
6 | 62.4 | 11.1 | 73.2 | 13 |
7 | 0.42 | 0.02 | 0.42 | 0.03 |
8 | 635 | 76 | 840 | 89 |
9 | 0.052 | 0.021 | 0.063 | 0.027 |
10 | 87.01 | 25 | 87.03 | 25 |
That table is pretty dense but it can be summed up as having:
- Four features that are virtually indistinguishable between classes,
- Four features that will have significant overlap but should be distinguishable in some cases, and
- Two features that will only have some overlap and should usually be distinguishable.
There will be two of these datasets created, one 1000-sample dataset that will be reserved as the validation set and another 1000-sample dataset that can be used for training/testing.
This creates a dataset which makes classification just tough enough.
Synthetic Data
Now's where it starts getting interesting! The synthetic data will follow one of two custom distributions. The first is what I'll refer to as the "Spikes Distribution." This distribution will only allow the synthetic features to take on a handful of discrete values with certain probabilities for each value. For example, if the original distribution had a mean of 3 and a standard deviation of 1, you might have a spike at 2 (27%), 3 (46%), and 4 (27%).
The second custom distribution is what I'll refer to as the "Plateaus Distribution." This distribution is simply a piecewise-uniform distribution. The probability of the plateaus is derived using the probability of the normal distribution at the center of the plateay. Any number of spikes or plateaus can be used and as you add more the distribution will more closely resemble the normal distribution.
To help with clarity, the following illustration depicts these distributions:
(Note: the graph of the spikes distribution is not a Probability Density Function)
The process of synthesizing data for this problem will make one very important assumption which errs in favor of making the synthetic data more closely resemble the "organic" data. This assumption is that the true mean and standard deviation of each feature/category pair is known. In reality, if the synthetic data was too far off from these values, it could severely impact the accuracy of the trained model.
Okay, but why use these distributions? How do they map to reality?
I'm glad you asked! In a limited dataset, you may be able to notice that for a certain category, a feature takes on a handful of values. Imagine these values were:
(50, 75, 54, 49, 24, 58, 49, 64, 43, 36)
Or if we sort the list:
(24, 36, 43, 49, 49, 50, 54, 58, 64, 75)
In order to generate data for this feature, you could split this up into three sections, where the first section would be the smallest 20%, the 60% in the middle would be the second section, and the third section would be the largest 20%. Using these three sections you could then compute their means and their standard deviations: about (30, 6.0), (50.5, 4.6), and (69.5, 5.5), respectively. If the standard deviations tend to be fairly low, say about 10% or less of the corresponding mean, you could treat that section as a spike at that mean. Otherwise, you could treat that section as a plateau with a width of twice the standard deviation of that section and centered at that section's mean.
Or, in other words, they do a decent job of emulating imperfect data synthesis.
Two 800-sample datasets will be created using these distributions — one using the spikes and another using the plateaus. The models will be trained using four different datasets in order to compare the usefulness of each:
- Full - The full 1000-sample organic dataset (used to get an idea of an upper limit)
- Real - Only 20% of the sample organic dataset (simulating the situation without adding synthetic data)
- Spikes - The "Real" dataset combined with the spike's dataset (1000 samples)
- Plateaus - The "Real" dataset combined with the plateau's dataset (1000 samples)
Now for the exciting part!
The Learning
In order to test the strength of each of these datasets, three different machine learning techniques were used: Multi-Layer Perceptrons (MLPs), Support Vector Machines (SVMs), and Decision Trees. To assist the learning, since certain features are much larger in magnitude than others, feature scaling was leveraged to normalize the data. The hyperparameters of the various models were tuned using grid search to maximize the probability of arriving at the strongest set of hyperparameters.
In all, 24 different models were trained on eight different datasets in order to get an idea on what effect the synthetic data had on the effectiveness of the learning. Feel free to check out the code here!
The Results
Drumroll Please...
After many hours tuning hyperparameters and scribbling down accuracy measurements, some counter-intuitive results emerged! The full set of results can be found in the tables below:
MLP
Full | Real | Spike 9 | Plateau 9 | Spike 5 | Plateau 5 | Spike 3 | Plateau 3 |
---|---|---|---|---|---|---|---|
97.0 | 95.3 | 97.8 | 95.8 | 97.7 | 98.0 | 95.3 | 97.4 |
SVM
Full | Real | Spike 9 | Plateau 9 | Spike 5 | Plateau 5 | Spike 3 | Plateau 3 |
---|---|---|---|---|---|---|---|
97.3 | 96.3 | 97.2 | 97.5 | 97.2 | 97.1 | 95.9 | 97.3 |
Decision Tree
Full | Real | Spike 9 | Plateau 9 | Spike 5 | Plateau 5 | Spike 3 | Plateau 3 |
---|---|---|---|---|---|---|---|
94.8 | 91.1 | 90.4 | 94.5 | 91.7 | 94.7 | 85.1 | 93.2 |
In these tables, "Spike 9" or "Plateau 9" refer to the distribution and the number of spikes/plateaus used. The values in the cells are the resulting accuracies of the models that were trained/tested using the corresponding train/test data and then validated on the validation data. Remember also that the "Full" category is should be a theoretical ceiling on the accuracy and the "Real" category is our baseline we could achieve without synthetic data.
An important note is that the training/testing accuracies for (nearly) every trial were significantly greater than the validation accuracies. For example, even though MLP scored 97.7% on spikes-5, it scored 100% and 99% respectively on the train/test data for that same trial. This can lead to significant over-estimations in the model's effectiveness when being used in the real world. The full set of these measurements can be found on the GitHub for this project.
Let's take a closer look at these results.
First, let's look at inter-model trends (i.e. the effects of the type of sythetic dataset across all machine learning techniques). It seems that adding more spikes/plateaus did not necessarily help learning. You can see an improvement in general between having 3 vs. 5 spikes/plateaus but that either flattens out or slightly dips when looking at 5 vs. 9.
To me, this seems counter-intuitive. I expected to see an almost constant improvement as more spikes/plateaus were added since that would lead towards a distribution more akin to the normal distribution used to synthesize the data.
Now, let's look at intra-model trends (i.e. the effects of various synthetic datasets on specific machine learning techniques). For the MLPs, it seemed to be hit or miss as to whether spikes or plateaus would lead to better performance. For the SVMs, both spikes and plateaus seemed to perform about equally well. For the Decision Trees however, plateaus are a clear winner.
Overall though, clear improvements were consistently observed when using the synthetic datasets!
Future Work
An important factor to note is that the results from this article, while useful in some right, are still be fairly speculative. A few more angles must be analyzed in order to safely claim to make any definitive conclusions.
One assumption that was made is that each class only has one "type" but this is not always true in the real world. For example a Doberman and a Chihuahua are both dogs, but their distribution of weights would look very different.
Further, this is essentially just a single type of dataset. Another facet that should be considered is trying similar experiments, except with datasets that have different dimensions of feature spaces. That could mean having 15 features instead of 10 or datasets that emulate images for example.
I do plan on continuing the research to expand the range of this study so stay tuned!
About the Author
Eric earned his Bachelor's degree in software engineering and his Master's degree in Machine Learning. He is currently working as a Machine Learning engineer in Toronto, Canada. He has worked on problems related to NLP, Computer Vision, and business intelligence systems using LSTMs, CNNs, decision tree ensembles, SVMs, and many more!
If you want to learn more about him, explore his site.