Codementor Events

A Heuristic for Multiple Times Speed-up of Model Training

Published May 08, 2021

The first question that comes to mind is HOW? The answer is simple, reduce the number of datapoints. However, a more interesting question is the way in which datapoints are reduced, which is formulated by combining ideas from various fields such as discrete mathematics, discrete optimization, and network architecture…  

For us all data scientists, the idea of reducing datapoints (or technically known as training set selection) directly opposes the machine learning (ML) principle of generalization. Surprisingly, two decades earlier, some of the most seminal scientific works were done on it [1, 2], mainly forced by the limited processing power at the time. Now that the processing power has led to the success of ML, big data has again turned the wheel in the opposite direction where we are finding it difficult to process/train the plentiful data within feasible computation resources. A very simple approach to reduce the datapoints would be to randomly select them (or technically known as simple random sampling (SRS)). But note that doing so would lead to a loss in critical datapoints which otherwise may have been the support vectors (if the classification algorithm was SVM) or the final hypothesis in a general case.

Furthermore, balancing class sizes is important for training with many types of classification algorithms. It is a byproduct in our method, whereas SRS doesn’t address it in its naive form.

The following plot of prediction accuracies with training models trained after two sampling methods, namely reservoir sampling and SRS, and our datapoint reduction/selection method, called graph shedding heuristic (GSH) exemplifies the advantages with our method.

With sampling methods proven ineffective, we present the technical outline of the entire procedure, which we will decipher step-by-step.

The procedure starts with first clustering datapoints (Step 1 in the flowchart), which is done for two major reasons,

  • Big-O complexity of rest of the steps reduces from number of datapoints to number of clusters.
  • Selecting via clusters gives a buffer zone of a few extra points, adding to the generalization of the procedure.

Step 2 or graph knitting step is the most crucial, where the unique manner of reduction/selection, the one we have been elusive about, is finally unfolded. In this step we construct a knowledge graph of the classification patterns of data, with two key considerations,

  • Edges are sufficiently formed across all the classification patterns,
  • While controlling the skewness during the graph construction.

The first consideration is important because classes tend to cluster in real data (technically called the cluster assumption or the low density separation assumption), hence we need to make sure that a situation like below doesn’t happen,

This situation can be addressed by exclusively searching neighbor of the opposite class, however, this might introduce artificial skewness in the procedure. For instance, below situation is likely to occur,

Both situations are detailed in Part I to III of Algorithm 1 of the original scientific article, and summarized in the high-level algorithm,

Once entire classification patterns are captured, the edges are weighed (using a novel edge weight scheme) to highlight the patterns, which can then be selectively chosen for training, leading to Step 3 or GSH,

The Findings

We applied Step 1 to 3 to reduce the data coming from two sources,

  • An unstructured text dataset, available at doi:10.5281/zenodo.3355823.
  • An image dataset, available at doi:10.5281/zenodo.3356917.

Note that feature engineering steps are key for both data types,

For the first dataset, with CNN classifiers from Tensorflow library, clear scaling observations are evident for GSH,

| Fraction of #datapoints | # features | Training time (seconds) | Prediction accuracy (%) |
| Tensorflow | GSH (/times) | Tensorflow | GSH (/times) |
| 0.2 | 10,621 | 2,603.82 | 1,201.11/2.2 | 98.54 | 98.45 |
| 0.5 | 15,613 | 11,851.98 | 1,664.38/7.1 | 99.13 | 99.13 |

For SMO implementation of SVM from LIBSVM library, similar scaling observation with slight improvement in prediction accuracy is observed,

| word2vec input dimension | Training time (seconds) | Prediction accuracy (%) |
| LIBSVM | GSH (/times) | LIBSVM | GSH (/times) |
| 500 | 100.88 | 12.06/8.3 | 98.04 | 98.79 |
| 1,000 | 201.33 | 21.64/9.3 | 98.12 | 98.91 |

For the second dataset, with VGG’16 feature generator coupled with LIBSVM classifier, clear scaling observations are evident,

| # features | Training time (seconds) | Prediction accuracy (%) |
| LIBSVM | GSH (/times) | LIBSVM | GSH (/times) |
| 25,088 | 328.39 | 70.51/4.7 | 99.16 | 99.16 |

Until now, we have applied graph algorithms to gain speed-ups of maximum 10 times, however, upon application of the remaining aforementioned fields, that is discrete optimization and network architecture, speed-ups of ~40 times are observed. This is the topic of the concluding post in this two-part blog series. Details on algorithms, edge weight scheme, test set-ups et. cetera can be found in the preprint, available at (https://arxiv.org/abs/1907.10421). Programming codes are available at (www.gstech.xyz/projects/gr_heu).

References

  1. Levy AY, Fikes RE, Sagiv Y (1997) Speeding up inferences using relevance reasoning: A formalism and algorithms. Artif. Intell. 97:83-136. doi:10.1016/S0004-3702(97)00049-0.
  2. Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif. Intell. 97:245-271. doi:0.1016/S0004-3702(97)00063-5.
  3. Chang CC, Lin CJ (2011) LIBSVM: A Library for support vector machines. ACM Trans. Intell. Syst. Technol. 2:27:1-27:27. doi:10.1145/1961189.1961199.
  4. Muja M, Lowe DG (2014) Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36:2227-2240. doi:10.1109/TPAMI.2014.2321376.
  5. Fran\c{c}ois C and others (2015) Keras. https://keras.io. Accessed on 27 July 2019.
  6. Abadi M and others (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/. Accessed on 27 July 2019.
Discover and read more posts from Sumedh Yadav
get started