Lean monitoring of production code
In this tutorial I will discuss a nimble solution to monitor code that is running in an agile production environment.
At the end of it, you will have:
- Learned how to use Python logging and
atexit
to integrate a easy but quite robust monitoring pattern in your code. - Got a template script via GitHub's gists.
Use-case overview
I have a very realistic use case in mind: monitoring a server that runs background jobs, in form of scripts.
What could they be, more concretely?
- Scripts to pre-calculate the results of expensive database queries, to avoid computing it in real-time (like in a API request handler).
- Charge users subscriptions!
- Send email to users for updates, notifications and more.
These are just few examples of a very common case indeed: a machine running in a cloud infrastructure that executes periodic jobs, that are scheduled, for example, with Cron.
Production applications need to be resilient, performant and, above all, measurable. This is a very broad topic that goes from Devevelopment Operations (DevOps), to Site Reliability Engineering (SRE), to Systems Control and Automation.
There's a ton of great knowledge out there on this subject. Google, for instance, has several recommended books about how they do SRE. AWS has extensive documentation about how to use their amazing infrastructure for applications performance's monitoring.
For sure, if you have a very, very large infrastructure with thousands of servers then you need experts on the matter and use all cutting-edge techologies.
But, in my opinion and work experience, this massive load of information is just overwhelming in the first stages of a new product.
What usually happens in the initial life-span of a new product (or application) is a lot of iteration between the development and the business teams and, in many occasions, the early customers that adopt the product. The figure below, taken from The Lean Startup, illustrate the concepts.
In my experience, dedicating time to design a proper production infrastructure to monitor how well the application code does is, initially, very frustrating for the developers.
The reason is that the application itself changes so quickly that the developers wills have to rebuild the "measurement infrastructures" plenty of times. This will drive down their motivations as well, which is in my opinion the worst that can happen to an agile organization. If the team's motivation goes down, the organization goes down.
In my team we set on a very simple solution to monitor how well our code does. It's so easy that we can reproduce it in less than 5 minutes, and you can do it too. I will show next how.
Fist, log everything
You should really stop using print
, or printf
or any standard printing functionality, unless that's the logging system of the programming language you chose.
Every good programming language comes with logging facilities (usually in form of a module in the standard library) and that's the one you shold be using.
There exist very good systems for log managements too; sentry, logstash and CloudwatchLogs are three names that you might want to learn more.
But like everything else, they require time to set-up and maintain the infrastructure. Here, I want to show a much more agile solution.
We use Python a lot, and we tipically initialize the logging in a module like this:
import logging
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s [%(threadName)s] [%(levelname)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
handlers=[...] # <-------- will get back to this in 1 minute!
)
log = logging.getLogger(__name__)
The application entrypoint
Quite simply, we use a __main__
call to a Python script. The trick though is that the entire call is wrapped around a try / except
, so that even when something goes wrong we don't loose all the logging messages and can still analyze them.
if __name__ == '__main__':
app_name = 'CODE_SAMPLE'
log.debug('Starting %s', app_name)
try:
entrypoint(sys.argv)
except:
log.exception('Error on %s', app_name)
else:
log.debug('Done.')
The actual entrypoint
As you saw in the __main__
above, the application logic is contained in a function that's called by the main. This function can really be anything; in our systems, it usually does many of these things:
- Calls to different databases.
- Calls to different external APIs.
- Calls to some internal APIs (for emails, geolocation, etc.).
For the sake of this tutorial, let's plug in some simpler logic, with a easy switch to generate an error so to see how the logs help.
def entrypoint(*args, **kwargs):
y = 5 + 3
log.debug('Computed y')
z = y + 2
#z = x + 2 # <---- This line has a typo that we'll use to test.
log.debug('Computed z')
The log handler
It's time to circle back to the log
object. We want it to be relient to crashes and bugs, so that even if the code fails somehow we can still retrieve the logs and store them somewhere.
To do this, we need to things.
- Instead of printing the logs to screen or to a file in the local system, let's just have a
string
in memory that keeps all logging messages for the current program run. This way, we don't have to handle files that can be corrupted or, even worse,stdout
messages that are just very difficult to retrieve. Because all logs are in a string, we can save and manipulate that string. - There must be a way to save the string object that contains the logs, regardless of whether the application exited successfully or not.
The first point can be achieved with the io
module in Python. Here's how to change the log definition.
import logging
import io # <----- New line
stringio_stream = io.StringIO() # <--- New line
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s [%(threadName)s] [%(levelname)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
handlers=[logging.StreamHandler(stringio_stream)] # <--- Changed line
)
log = logging.getLogger(__name__)
Let's talk about the second point. We could modify the __main__
so that it does what we need:
import sys
if __name__ == '__main__':
app_name = 'CODE_SAMPLE'
log.debug('Starting %s', app_name)
try:
entrypoint(sys.argv)
except:
log.exception('Error on %s', app_name)
else:
log.debug('Done.')
save_logs(log) # <----- New line
This will work OK. There's a nicer solution though, which works by using the atexit
package, that is part of the standard Python library.
atexit
has a function named register
that allows to register the execution of a function for when the program ends.
Basiaclly, no matter what happens within your code (crashes, bugs, or just a fine execution), the piece of code registered with atexit
will be executed just before the program exits.
With that being said, here's how we use to save the logs.
import atexit
atexit.register(
save_logs,
body=stringio_stream.getvalue() # <-- This is the trick!
)
What to do with the monitoring logs?
The last piece to the puzzle is to implement the function save_logs
. What to do in it really depends on what's easier for you and your team.
We save all the logs as files in a S3 bucket. It's pretty lean for us to go in the bucket, download the file related to a specific run of the application and look at the logs to see if something went wrong.
But you could do many different things. You could even send the logs to your team via email.
To show case this tutorial and have a template code that actually works, I will use the service cl1p.net. This is a free service online that allows everybody to send text and display it on a webpage for a brief time.
Here's how it would work in the save_logs
function.
import time
import requests
def save_logs(body):
# This function should implement a robust solution
# to store the log string.
# E.g., it could upload a file to S3.
# For the sake of this tutorial let's upload the string
# to cl1p.net
# In a real application you will want to save a file
# with the name `app_name + timestamp`.
#clip_name = app_name + '_' + str(int(time.time()))
# To use it with cl1p.net we must use a fixed name
clip_name = 'test-clip-name'
headers = {'Content-Type': 'text/html; charset=UTF-8'}
endpoint = 'https://api.cl1p.net/' + clip_name
requests.post(endpoint, headers=headers, data=body)
Connecting the dots
At this point, you can connect all code snippets and you will have a working template that you can extend further.
For your applications, you should change two parts of the template:
- Write the custom logic of your application in the
entrypoint
function. - Replace the call to
cl1p.net
with a custom, more resilient approach, such as a call to theboto3
functions to upload the file to S3.
Here's the entire template, and you can also download it from this GitHub gist.
import time
import logging
import io
import atexit
import sys
import requests
stringio_stream = io.StringIO()
logging.basicConfig(
level=logging.DEBUG,
format="%(asctime)s [%(threadName)s] [%(levelname)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
handlers=[logging.StreamHandler(stringio_stream)]
)
log = logging.getLogger(__name__)
def entrypoint(*args, **kwargs):
# This function should contain the custom logic
# of your application.
# Make it as complicated as you need!
y = 5 + 3
log.debug('Computed y')
z = y + 2
#z = x + 2 # <---- This line has a typo that we'll use to test.
log.debug('Computed z')
def save_logs(body):
# This function should implement a robust solution
# to store the log string.
# E.g., it could upload a file to S3.
# For the sake of this tutorial let's upload the string
# to cl1p.net
# In a real application you will want to save a file
# with the name `app_name + timestamp`.
#clip_name = app_name + '_' + str(int(time.time()))
# To use it with cl1p.net we must use a fixed name
clip_name = 'test-clip-name'
headers = {'Content-Type': 'text/html; charset=UTF-8'}
endpoint = 'https://api.cl1p.net/' + clip_name
requests.post(endpoint, headers=headers, data=body)
if __name__ == '__main__':
app_name = 'CODE_SAMPLE'
log.debug('Starting %s', app_name)
try:
entrypoint(sys.argv)
except:
log.exception('Error on %s', app_name)
else:
log.debug('Done.')
atexit.register(
save_logs,
body=stringio_stream.getvalue() # <-- This is the trick!
)
Testing the template out
There are two tests I would want to do at your place.
First, let's just run the template as-is. It will take just one second. When it's finished, go to https://cl1p.net/test-clip-name
Surprised? Yes, that's exactly the full log trace of your code!
You could do many interesting things with it:
- First of all, check that there is no error.
- Analyze how much time passes between each message, to figure out if any part of the code is slow.
There is a second test I think it's worth checking out. Comment out line 27 and uncomment line 28. Like you can see, this will create a bug in the code, because the variable x
is not defined.
Run the code again, then go again on https://cl1p.net/test-clip-name
What do you see now? Yes, that's again the full log error but with very useful information with the entire stacktrace!
Even if the application crashed, we still saved useful information to fix it.
But the greatest benefit is that we did this basically with no effort! Take a look again at the code: it's really, really simple!
Hopefully this can be useful to bootstrap some interesting projects so you can save time and only move to complex monitoring and logging when you really need it!
Let me know if you run into any issue reproducing this tutorial.
Bonus: upload the logs to S3
I generally upload the logs as a file to S3. Here's the function I use, (you need to have the AWS credentials loaded as environment variables).
import boto3
from datetime import datetime
def save_logs(body, app_name):
now = datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%S')
key = app_name + '_' + now
client = boto3.client(
's3',
aws_access_key_id=environ['AWS_ACCESS_KEY'],
aws_secret_access_key=environ['AWS_SECRET_KEY']
)
bucket = environ['S3_LOG_BUCKET']
client.put_object(Body=body, Bucket=bucket, Key=key)
Hi Pietro,
I appreciate your sharing with this logging strategy, just wanted to add that with this approach extensive logging might be memory expensive too due to in memory buffer. Thank you for the knowledge share !
Hi Oren,
Great insight, thank you! For something more robust, but less agile, I use a NonStopQueueHandler with individual log messages stored in a database.
What is your approach?