Optimizing for Latency vs Throughput
Check out the original version of this article on my Developer Blog.
Every developer is familiar with the terms latency (a.k.a. request time) and throughput (or req/s) in relation to web applications.
We all instinctively know that the two are interrelated, if I manage to reduce the server workload needed to execute a single request, the throughput will go up as well. Of course, if a certain request requires X amount of CPU work (let's call it CPU-seconds), then - ignoring IO and other delays - each request will finish in about X seconds, and the throughput on a C-core machine will be C/X requests every second.
We can see this visually here (for simplicity I'm going to ignore async
and similar effects):
What most people don't realize is that this win-win/lose-lose relationship is only true as long as every request is handled on one single thread from beginning to end (or jumping between threads but never working in parallel). As soon as you are starting to parallelize a request, the direct relationship breaks and you are introducing a trade off between your end-user's much desired latency and your performance auditor's maximum throughput measures. Let's modify the example above to process the request in parallel on all 4 cores.
In theory parallelism looks like a holy grail here, allowing us to reduce request times by a four-fold in our 4 CPU example:
Realize here how the directly proportional relationship just broke: Our requests got a lot shorter, without any change in maximum throughput. (As even though every request now takes X/4 seconds, it is using up 4 CPU cores instead of just 1. But this is still perfect, right? We increased latency a lot and throughput didn't change, sounds great!
Except this was the in theory kind of paralellism, one that you can never pull off perfectly on an actual machine. Every parallel execution has some overhead. What used to be a simple loop and some work over each now becomes Partition, Queue, Wait for an available Thread Pool thread (possibly create one), Do the work, Join.
This is a more realistic representation of what'd happen in reality:
So as you can see, our latency still decreased (by 50%), but our throughput decreased, too! because of the overhead.
Now whether this is worth it is a business decision. The actual amount of overhead is hugely variable and depends on both the software architecture, coding practices, and external factors. I am not aiming to go into detail on these in this article, but reading any TPL / concurrency based article (including some here on my blog) will help if you need some brushing up. You can of course always fix the throughput by "throwing hardware at the problem" (a.k.a. horizontal scaling), but that may not be worth the cost from a business perspective.
My point for today is that there is a tradeoff involved in parallelizing computation in Web APIs, and you should actively decide whether the decreased latency is worth the throughput drop. For some cases it is, for some it isn't.
Feedback
If you enjoyed this article, feel free to check out my Developer Blog where I share more stories like this.
If you have any feedback or wish to learn more about a topic with me in a 1:1 mentoring session, feel free to contact me!