Django Optimization: Or how we avoided memory mishaps
(Cross-posted from my blog on Medium)
Working in a tech startup is akin to fighting a series of fires that crop up every now and then (faster, if you're iterating at breakneck speeds). Each time you douse a fire enough to maintain some calm over the next few months, but you know in the back of your mind that this isn't over. Thus it becomes important to pick your battles.
The Problem at Hand
First, some context - we have this Celery task that runs every night and performs some time-consuming but important computations that preserves the "freshness" of a certain table in the database. It is one of the critical parts of our codebase, and it is essential that this task should complete successfully everyday. It's given us trouble before, due to the relatively large amounts of data that it churns, and has seen more than one optimization applied to it. The task stopped executing fully recently, and I found segfault messages when investigating production logs for a different issue.
Consider an ultra simplified version of the task code as follows -:
books = Book.objects.filter(some_field=some_condition)
authors = Author.objects.filter(some_other_field=some_condition)
results = {}
for x in authors:
for y in books:
# Insert computations here
results[(x, y)] = True # Not that simple, but still
return results
books
and authors
are both Django querysets, although not representative of our actual models. The latter had 2012 objects and the former about 17k objects at the time of writing. Since this task's inception, this double for loop has served us well. Then we hit the first memory snags last year, and the recent segfaults. I decided to strip the code down to bare essentials, similar to the kind I've shown above, and run a few tests using the htop command. Here's a gif of the same when I was investigating the problem on our staging machine, which has 4 gigs of RAM, as opposed to 10 gigs on production (according to the simplified code) - :
Memory usage rises pretty fast, and before we know it, we hit the 4gig limit on the machine. If I was to measure in terms of space complexity, this would be O(MN), where M and N are both (approximately?) linear functions which track the growth of both querysets over time. M is clearly slower than N.
Batching the Queryset
I then re-applied the first optimization that had been applied a year ago - that of the use of batched querysets. There are two main places¹ where memory is consumed in our script, the first being the python database connector (in this case, the Python MySQL DB connector), which performs the task of fetching the results, and the second being the queryset cache. Queryset batching pertains to optimizing the first place.
Here's code with batching applied to the second queryset - :
def batch_qs(qs, batch_size=1000):
"""
Returns a (start, end, total, queryset) tuple for each batch in the given
queryset. Useful when memory is an issue. Picked from djangosnippets.
"""
if isinstance(qs, QuerySet):
total = qs.count()
else:
total = len(qs)
for start in range(0, total, batch_size):
end = min(start + batch_size, total)
yield (start, end, total, qs[start:end])
books = Book.objects.filter(some_field=some_condition)
authors = Author.objects.filter(some_other_field=some_condition)
results = {}
for x in authors:
for _, _, _, qs in batch_qs(books, batch_size=1000):
for y in qs:
# Insert computations here
results[(x, y)] = True # Not that simple, but still
return results
We execute the task once again, and take a look at htop.
Hmm. The rate at which memory usage grows is slower than last time, but it inevitably hits the limit and segfaults. This makes sense, because queryset batching does not reduce the number of results that are fetched from the database. It only endeavours to keep as little as specified in memory at a time, by configuring the batch size. This leads us to our current predicament, and the second place to optimize.
The Nature of Querysets
Django querysets are said to be lazily loaded and cached¹ ². Lazy loading means that until you perform certain actions on the queryset, such as iterating over it, the corresponding DB query won't be made. Caching means that if you re-use the same queryset, multiple DB queries won't be made.
qs = Book.objects.filter(id=3)
# The DB query has not been executed at this point
x = qs
# Just assigning variables doesn't do anything
for x in qs:
print x
# The query is executed at this point, on iteration
for x in qs:
print "%d" % x.id
# The query is not executed this time, due to caching
Now how does caching come into the mix here? It turns out that due to the cache, we aren't able to "throw away" (garbage collect) the batches that have already been used. In our case, we use each batch only once, and caching them is a wasteful use of memory. The caches would not be cleared until the end of the function. To combat this, we use the iterator()³ function on the queryset.
...
for x in authors:
for _, _, _, qs in books:
for y in qs.iterator():
# Insert computations here
results[(x, y)] = True # Not that simple, but still
...
Let's give this a spin. Memory usage growth should slow down to a crawl.
But it doesn't!
Somehow, despite the use of iterator, we're storing stuff in memory somewhere. Let's take a good look at the code. Are there any references to objects that could be preventing the garbage collector from reclaiming memory? Turns out there is one place - the dictionary used to store tuples of objects as keys.
results[(x, y)] = True
A simple test I ran was to comment out this line and see how much memory the task takes. Memory usage was no longer climbing frantically, and instead, was frozen at a certain value.
For each author from the outer loop, we were iterating over different "copies" of books in the inner loop. For two authors A1 and A2, the tuples (A1, B) and (A2, B) pointed to different copies of B in memory, due to the use of iterator. Thus, we would have to reintroduce caching, but on our terms.
books = Book.objects.filter(some_field=some_condition)
authors = Author.objects.filter(some_other_field=some_condition)
results = {}
book_cache = {}
for x in authors:
for _, _, _, qs in batch_qs(books, batch_size=1000):
for y in qs:
# Insert computations here
if y.id not in book_cache:
book_cache[y.id] = y
cached_y = book_cache[y.id]
results[(x, cached_y)] = True # Not that simple, but still
return results
And there we go. Memory usage grows slowly, hovers around certain values, and after a while, drops significantly (not shown). The purpose of book_cache
is to store every book object that is seen on one full pass of all book objects, for re-use in subsequent full passes. The objects pointed to by y
in those subsequent passes would be garbage collected, and the cached versions would be used. Now, for (A1, B) and (A2, B), both Bs point to the same object in memory. In the end, iterator() allows us to control how we spend our free memory. The space complexity of this final code would be O(M + N), as we have only a single copy of every book object in memory.
Note that this solves our problem for now. As the database grows in size, there will be more problems down the line. This means that the code isn't perfect, but then, the target is never "perfection", it's "optimize for the now".
PS: to address some of the Medium comments - I can indeed store the IDs of objects, pairs of them, as keys. This would've been the ideal solution. But that doesn't work for me in this situation because the code above is a small section of a larger workflow, where the object instances are required.
[1] Memory Efficient Django Queries (www.poeschko.com/2012/02/memory-efficient-django-queries/)
[2] Django Querysets (https://docs.djangoproject.com/en/2.0/topics/db/optimization/#understand-queryset-evaluation)
[3] iterator() (https://docs.djangoproject.com/en/2.0/ref/models/querysets/#iterator)