Basic Performance Optimization Techniques for R
In the few last decades, demand for computing power has steadily increased as data volume has become larger and models have become more complex.
It is obvious that minimizing the time needed for calculations has become an important task, and it is not uncommon for the execution time of an R program to be measured in hours or even days.
As the volume of data to be analyzed increases, the execution time can become prohibitively long — it's often the case that data scientists get stuck with these bottlenecks.
When this happens, and if they don't know much about performance optimization in R, they'll probably just settle with reduced amounts of data, which can hinder their analysis.
However, fear not, R programs can be slow, but well-written R programs are usually fast enough. In this post, I will show you a couple of basic techniques you can use to make your R implementations faster.
If you would like to understand more about why R can be slow sometimes and how to deal with it, as well as more advanced topics on performance optimization than what I'll show here, you can read my recent book, "R Programming by Example." If you do read it, be sure to send me any comments or questions you may have! I appreciate them.
Should I optimize for performance?
First of all, it's a well known fact that an inefficient algorithm implemented efficiently can be orders of magnitude slower than an efficient algorithm implemented inefficiently.
This means that most of the time algorithm selection is much more important than implementation optimization. Having said that, let's assume you selected or designed a good enough algorithm.
Now, before you do any performance optmization on your implementation, ask yourself, "Should I invest the time to optimize the implementation?" Performance optimization can be a very costly activity. Your time and your client's resources are valuable and they probably are better spent on something else.
Let's say that for some reason you must make your implementation faster. The first thing you must decide now is how fast is fast enough.
Is your algorithm required to simply finish within hours, instead of days, or do you need to to come down to microsecond levels? Is this an absolute requirement or should you simply do the best job you can within a specific timeframe?
These are important questions that you must consider before optimizing your implementation, and sometimes the solution is not even optimization.
It's not rare for clients to prefer spending more money on some type of cloud resource to tackle the performance problem rather than spending your valuable time optimizing an algorithm's performance — especially if you can provide more business value doing something else.
What are Simple Moving Averages?
The algorithm we will work with is called Simple Moving Average (SMA). It's a very well-known tool for doing technical analysis of time-series, especially for financial markets and trading.
The idea behind SMA is that you will compute an average at each point in time by looking back at a predefined number of periods. For example, let's say you're looking at a minute-by-minute time-series, and you are going to compute an SMA(30)
.
That means that at each observation in your time-series, you will take the observations that correspond to the previous 30 minutes, and will save the average for these 30 observations as the SMA(30)
value for that point in time.
Subsets for SMA(3) and SMA(4) non-NA values for monote time-series |
SMAs have three basic properties.
First, SMAs are also a time-series, and they have the same number of observations as the original time-series.
Second, the first n - 1
elements in an SMA(n)
are NA
because at each of those points in time there are not enough observations backwards to compute the required average.
Third, note that every time we move one time-unit forward, we add one observation at the front of the current subset, and we remove one observation from its back, as the image above shows.
Our first, very inefficient, attempt at an SMA
Before we look at our first implementation, a few words about the data we will use. The data comes from a very rough simulation for Bitcoin and Litecoin prices.
The actual data file can be downloaded here, and it's a tabular structure where each observation is a price point with the following variables: timestamp
(string in YYYY-MM-DD-HH-mm
format), price_usd
(float), name
(string with either "Bitcoin" or "Litecoin"), and symbol
(string with either "BTC" or "LTC").
All of our SMA implementations for this post will receive the following parameters: period
to specify how many observations to use for the SMA, symbol
to denote the asset we want to make the calculations for (either "BTC"
for Bitcoin or "LTC"
for Litecoin), and the actual data
, which assumes a structure as what I defined in the previous paragraph.
Furthermore, all implementations in this post make two assumptions on data
: the timestamp
column is in increasing order and we don't have gaps in the time-series, meaning that for every minute we actually have price data.
This allows us to skip any ordering procedures and checking whether the SMA should contain NA
's internally when no data is available. Both of these assumptions are valid in the case of the simulated data linked above.
Now, let's turn to the actual implmentations. Our first, and very inefficient, SMA implementation is sma_1()
, and it is programmed using a style that may look fine for other types of languages, like Fortran or C++, but is not so for R.
First, let's define what some of the objects represent. The end
integer denotes the end, or right-extreme, of the SMA interval under consideration.
The position
integer, which is the same as the end
every time we start the loop, denotes the actual observation we're currently using within the subset of observations for any given average.
The sma
real will contain the actual SMA for the end
position. The n_accumulated
integer keeps track of the number of observations we have accumulated for a given subset and is used to know if we have enough (meaning period
of them).
Finally, the period_prices
data frame contains a single column to store the prices for the current SMA calculation.
Second, note that since we have mixed data for Bitcoin and Litecoin prices, we must check whether the observation at the current end
corresponds to the symbol
we're interested in.
If it doesn't, we simply ignore that iteration, but if it does, then we will accumulate period_prices
going backwards until the number of accumulated prices is equal to the period
, or the current position
is equal to 1, meaning that we're at the beginning of the time-series.
Finally, note that the position
integer is increased regardless of whether the observation was useful or not, so that we don't get stuck.
Third, and finally, after the while-loop is finished, we know that we either have accumulated a number of prices equal to the period
we are interested in, or we encountered the beginning of the time-series.
In the first case, we compute the mean of such prices by iterating over the period_prices
dataframe and assign that as the sma
value for the current end
position. In the second case, we simply record an NA
since we were unable to compute the full average.
sma_1 <- function(period, symbol, data) {
result <- data.frame(sma=numeric())
for(end in 1:nrow(data)) {
position <- end
sma <- NA
n_accumulated <- 0
period_prices <- data.frame(price=numeric())
if (data[end, "symbol"] == symbol) {
while(n_accumulated < period & position >= 1) {
if (data[position, "symbol"] == symbol) {
period_prices <- rbind(
period_prices,
data.frame(price=data[position, "price_usd"])
)
n_accumulated <- n_accumulated + 1
}
position <- position - 1
}
if (n_accumulated == period) {
sma <- 0
for (price in period_prices$price) {
sma <- sma + price
}
sma <- sma / period
} else {
sma <- NA
}
result <- rbind(result, data.frame(sma=sma))
}
}
return(result)
}
If the implementation seems complicated, it's because it is. As we start improving our code, it will become simplified naturally, which will make it easier to understand.
We will take the first 100 observations, which correspond to 50 minutes of Bitcoin price action, and calculate the SMA(10)
for them. Remember that according to our data assumptions, these 100 observations are divided in 50 for Bitcoin and 50 for Litecoin.
data_original <- read.csv("./data.csv")
data <- data_original[1:1000, ]
symbol <- "BTC"
period <- 10
s1 <- sma_1(period, symbol, data)
s1
#> sma
#> 1 NA
#> ... NA
#> 9 NA
#> 10 7999.910
#> 11 7999.231
#> 12 8000.703
#> 13 7999.489
#> 14 7999.924
#> 15 7999.360
#> 16 8000.284
#> 17 7999.978
#> 18 7999.306
#> 19 7999.508
#> 20 7998.562
(Truncated output)
I removed results 2 through 8 to save space, because, as we know at this point, they are all NA
values. Everything seems to be working fine, so now we need to measure how much time it takes to execute this code.
To measure it, we use the microbenchmark
function, from the microbenchmark
package, and specify microseconds as our unit. This will execute our SMA code 100 times and provide statistics about how much time it took, which can be extracted with the summary()
function. In this case, we are interested in the median, so that's what we ask for.
library(microbenchmark)
performance <- microbenchmark(s1 <- sma_1(period, symbol, data), unit = "us")
summary(performance)$median
#> [1] 142136
The median time for our sma_1()
implementation was 142,136 microseconds, which is our base case for comparison. Let's say our goal is to get it close to 200 microseconds. In the following sections we will see how to achieve this by making a couple of simple changes.
Move checks out of iterative processes
Let's start with the obvious, move checks out of iterative processes. In this case, we are unnecessarily checking whether we are working with a symbol
. To make it worse, we do it in two different places.
Believe it or not, I often see code that can be faster by simply moving checks out of iterative processes, as is the case here. To fix this, we simply introduce a filter at the beginning of the function that keeps only observations that contain the correct symbol
. After doing so, we can be sure that all observations have the correct symbol.
Doing so reduces two indentation levels in our code, since these checks were nested. Doing so feels great, doesn't it? Now it seems that we have a much simpler implementation that intuitively will perform much better.
sma_2 <- function(period, symbol, data) {
data <- data[data$symbol == symbol, ]
result <- data.frame(sma=numeric())
for(end in 1:nrow(data)) {
position <- end
sma <- NA
n_accumulated <- 0
period_prices <- data.frame(price=numeric())
while(n_accumulated < period & position >= 1) {
period_prices <- rbind(
period_prices,
data.frame(price=data[position, "price_usd"])
)
n_accumulated <- n_accumulated + 1
position <- position - 1
}
if (n_accumulated == period) {
sma <- 0
for (price in period_prices$price) {
sma <- sma + price
}
sma <- sma / period
} else {
sma <- NA
}
result <- rbind(result, data.frame(sma=sma))
}
return(result)
}
Now, it wouldn't be a correct comparison if we measured implementations that does different things, would it? Checking that our implementations produce the same results is useful for increasing our confidence that every change we make does not introduce unexpected behavior. It's very similar to unit-testing.
Also, as you probably know, with real numbers you're better off checking that they are close enough as opposed to exactly the same. If you check for identical numbers, you may get a FALSE
result due to one of the numbers having a difference of 0.000000001, which is insignificant in our case.
In this case, we specify 0.001 as our significant value threshold and test that each pair of numbers has a difference lower or equal to that threshold.
As can be seen in the comparison, all values are the same, so we can proceed with confidence.
performance <- microbenchmark(s2 <- sma_2(period, symbol, data), unit = "us")
all(s1$sma - s2$sma <= 0.001, na.rm = TRUE)
#> TRUE
summary(performance)$median
#> [1] 97514.58
Our intuition was correct — the median time for sma_2()
was 97,514.58 microseconds, which is around 68% of the median execution time for sma_1()
. That's a significant improvement for just getting checks out of iterative processes.
Use the simplest data structure possible
Many R users would agree that data frames are basic tools for data analysis. They are an intuitive way to represent a typical dataset with rows and columns representing observations and variables, and they can be very flexible. However, their convenience comes with a performance cost that people often don't mention.
As you can see in sma_3()
, the code is practically the same as sma_2()
, except that the result
and period_prices
objects are no longer data frames. Instead, they have become vectors. We are still dynamically expanding their lengths when recursively calling the c()
function, which is something you shouldn't be doing for performant code, but we will take it step by step.
sma_3 <- function(period, symbol, data) {
data <- data[data$symbol == symbol, ]
result <- NULL
for(end in 1:nrow(data)) {
position <- end
sma <- NA
n_accumulated <- 0
period_prices <- NULL
while(n_accumulated < period & position >= 1) {
period_prices <- c(period_prices, data[position, "price_usd"])
n_accumulated <- n_accumulated + 1
position <- position - 1
}
if (n_accumulated == period) {
sma <- 0
for (price in period_prices) {
sma <- sma + price
}
sma <- sma / period
} else {
sma <- NA
}
result <- c(result, sma)
}
return(result)
}
We proceed to check for correctness and benchmark. However, we must make a small change since sma_1()
returns a data frame but sma_3()
returns a vector. We need to make sure we're comparing using the correct syntax, as shown below.
performance <- microbenchmark(s3 <- sma_3(period, symbol, data), unit = "us")
all(s1$sma - s3 <= 0.001, na.rm = TRUE)
#> TRUE
summary(performance)$median
#> [1] 4425.707
As we can see, the median time for sma_3()
was 4,425.707 microseconds. Removing these dataframe structures allowed us to only take around 4% of the previous median time! That's amazing for such an easy change, isn't it? But, we can do better .
Vectorize arithmetic operations
Vectorization means removing a manual looping mechanism in favor of operations optimized to do the same thing without the need for explicit loops. Vectorizing is a fundamental tool in R and you should get used to programming, using it instead of explicit loops whenever possible.
Tip: explicit loops may be efficient in other languages, like Fortran and C++. However, in R you're better off using vectorization most of the time.
There are various ways of vectorizing operations. A common use for vectorization is for matrix or vector operations.
Instead of iterating through elements in different vectors or matrices, and multipliying and adding them accordingly, you can simply use the %*%
operator.
Another way of vectorizing code is to replace loops with R built-in functions, and that's what we will use here.
The while-loop, in combination with the if-conditional that follows it, is used to keep track of the number of accumulated period_prices
using the n_accumulated
integer, and applies the necessary arithmetic operations to calculate the required average.
This seems cumbersome, and we can do much better. To fix this, we will create a start
position that will be equal to the end
position for the current subset, minus one to make sure indexes match correctly.
Then, instead of iterating through accumulated prices, we will simply extract the range of values from start
to end
, and take their average using the vectorized mean()
function.
Note that we still need to check whether there are enough observations backwards, which we do by checking that the start
position is greater or equal to one.
sma_4 <- function(period, symbol, data) {
data <- data[data$symbol == symbol, ]
result <- NULL
for(end in 1:nrow(data)) {
start <- end - period + 1
if (start >= 1) {
sma <- mean(data[start:end, "price_usd"])
} else {
sma <- NA
}
result <- c(result, sma)
}
return(result)
}
As you may have noted, our code is becoming more expressive as we continue making it simpler. When programming with R, this happens often when vectorization comes into play, which is another advantage of using it.
Note that this change would not have been possible if we had not filtered the data at the top of the function, since we would have observations that correspond to different symbols, mixed among themselves, and our start:end
interval would pick observations that contain other symbols.
This goes to show that sometimes optimizations depend on one other — one can't be applied without applying it to a previous one — and these relations are often found accidentally.
Again, we check for correctness and benchmark.
performance <- microbenchmark(s4 <- sma_4(period, symbol, data), unit = "us")
all(s1$sma - s4 <= 0.001, na.rm = TRUE)
#> TRUE
summary(performance)$median
#> [1] 561.0705
Now, the median time for sma_4()
was 561.0705 microseconds, which is only 12% of the previous best value. As you can see, using vectorized arithmetic not only makes our code easier to read, but also makes it faster. Cool, huh?
Before we move to the next section, I just want to mention that there are many other vectorized functions in R that may help speed your code. Some examples are which()
, where()
, any()
, all()
, cumsum()
, and cumprod()
.
When working with matrices you may use rowSums()
, colSums()
, lower.tri()
, upper.tri()
, and many others. When dealing with something that seems that could be vectorized, chances are there's already a function for that.
Vectorize iterations
Now we will use another approach to vectorization, which involves using the family of *apply()
functions (e.g. lapply()
, sapply()
, and so on). These functions allow us to apply another function to each element in the object we pass, regardless of whether the function is an R built-in function, it comes from external package, or it is created by us.
This will produce simpler code than explicit loops and will also make your implementation faster, because these functions are highly optimized. Whenever you can, you should use these functions instead of explicit iterations.
NOTE: Technically, the
apply()
function is not as optimized as the other functions in its family, so the performance gains won't be as much as the other ones, but code clarity will indeed increase.
To use this technique, we will abstract the logic within the for-loop into its own function named sma_from_position()
, and we wil use the sapply()
function to apply it to each of the subsets in the data.
To do this, we send the end
integer as the first argument to the sapply()
function, followed by the function we want to apply (note that we don't use parentheses because we want to send the function object, not call the function and then pass its result).
Since period
and data
are fixed objects, we can pass them along after the function name as the remaining two arguments for sma_from_position()
.
An added benefit of the sapply()
function is that it takes care of the memory pre-allocation for us, which is a very efficient way to reduce execution time in R.
Also, note that instead of using an explicit if-else conditional, we use the ifelse()
function, which takes the condition to be checked as its first argument, the desired result in case of the condition being met as its second argument, and the desired result in case the condition is not met as its third argument.
Finally, note that to remove the use of the remaining data frame even more, we keep only the price_usd
variable data, in the first line of the sma_5()
function.
sma_5 <- function(period, symbol, data) {
data <- data[data$symbol == symbol, "price_usd"]
return(sapply(1:length(data), sma_from_position, period, data))
}
sma_from_position <- function(end, period, data) {
start <- end - period + 1
return(ifelse(start >= 1, mean(data[start:end]), NA))
}
Again, we benchmark and check for correctness. This is the last one, I promise.
performance <- microbenchmark(s5 <- sma_5(period, symbol, data), unit = "us")
all(s1$sma - s5 <= 0.001, na.rm = TRUE)
#> TRUE
summary(performance)$median
#> [1] 224.731
The median time for sma_5()
is 224.731 microseconds, which is 40% of our previous best time. We will stop here, but note that we have almost achieved our goal of 200 microseconds. This means that we have achieved performance improvements in orders of magnitude. Did you think it was possible by these types of simple changes?
Conclusion
The performance improvement has been drastic. Our best implementation takes around 0.15% of our initial benchmark time. This means that if we were actually using this code with data that required our best implementation to take one hour to finish, the implementation we started with would take around 26 days to achieve the same results.
The fact that we can achieve this level of performance improvement with such simple changes is surprising to me, and that's why, as I mentioned in the introduction, good R code can often be fast enough, but bad R code can be painfully slow.
Implementation | Microseconds Median | % from sma_1 |
---|---|---|
sma_1 |
142136.00 | 100 % |
sma_2 |
97514.58 | 68.60 % |
sma_3 |
4425.707 | 3.11 % |
sma_4 |
561.0705 | 0.39 % |
sma_5 |
224.731 | 0.15 % |
To further reduce the execution time of our implementations, we would have to resort to more advanced techniques, such as parallelization and/or delegation to statically typed languages like Fortran or C++.
To learn more about the topics covered in this post, such as learning how to parallelize your code or how to delegate it to Fortran or C++, as well as how to find what parts of your code are performance bottlenecks, you may read my recent book, "R Programming by Example."
In it, I use three real-world examples to teach basic-to-advanced techniques for R programming. As I have said before, if you do read it, make sure to send me any comments or questions you may have! I appreciate them.