Codementor Events

Java Production Debuggers to Save Time and Resources

Published Aug 03, 2021Last updated Aug 06, 2021
Java Production Debuggers to Save Time and Resources

Production errors are one of the worst things that can happen to developers. They are almost always unexpected and sudden, difficult to reproduce. Also, depending on your monitoring system, you might not even be notified until a significant amount of damage has occurred. When a production error happens, the resulting workflow is typically as follows:

  1. Check the severity, whether it's a total service outage (most critical) or whether it's just an uncommon bug affecting only 1 user.

  2. Go through the logs to pinpoint which services are throwing this error.

  3. Attempt to remedy this error, typically by reproducing it locally. This is sometimes quite difficult to do due to the difference in systems and environments.

Although this flow might seem simple in principle, it's quite tricky to implement in actual enterprise production systems. Typical low-quality systems have these attributes:

  1. The system isn't fully monitored, and thus you aren't even aware of the production bug until a customer complains.

  2. Either the logs aren't comprehensive enough, and thus you can't pinpoint which part of your system is throwing this error, or they are huge and difficult to go through, so it takes a significant amount of time to find the bug.

A high-quality production debugger system typically allows you to:

  1. Spend less time searching for the bug and more time on actually fixing it

  2. Reduce downtime and increase customer satisfaction

  3. Add logs, snapshots & metrics to the live app in real-time without having to redeploy these new additions each time on a CI/CD pipeline (so it doesn't interrupt the app)

  4. Provide useful metrics & analytics about the application

  5. Use it without fear of the application's performance being affected in a significant way

Methods of Testing and Debugging in Production

There are many different ways and options of performing such a task in production. You generally want to pick a few of these and execute them effectively.

The first one is shadowing (aka mirroring). This usually entails sending a portion of the production traffic to a newly deployed service and seeing how it handles this load. This can be an attempt to reproduce a production bug/error or a method to check for potential errors before they even happen.

The second one is monitoring and logging. Logging is where the values of certain variables are printed to a console or stored for debugging. Monitoring on the other hand is displaying, saving, and analyzing the performance of the running application. This can be especially useful with remote debugging, as it assists when teams are far apart and can't work together in the same place.

The third one is distributed tracing. This is aimed at more modern large-scale architectures that use a microservice architecture instead of a monolith one. Debugging microservices can be difficult due to their distributed nature. This is where distributed tracing comes in.

"Distributed tracing, sometimes called distributed request tracing, is a method to monitor applications built on a microservices architecture. ... This allows them to pinpoint bottlenecks, bugs, and other issues that impact the application's performance."---Splunk

Debugging and Monitoring Production Tools

1. Lightrun Cloud

Lightrun Cloud is a free production debugger. It works by identifying where you want to add a log, metric, or snapshot through the IDE or CLI. Then, you can add either of those during the runtime without interrupting the application. After that, you can easily inspect the logs and fix the bug. This provides an effective way to debug your running application in production without affecting the customer experience. It also gives developers the flexibility to choose how to add the logs and provides easy-to-use plugins to add the logs as shown here:

Lightrun mainly provides 3 different features:

  1. Logs: After adding logs, developers can explore and inspect the log analysis in a tool of their choosing, such as Datadog or Elastic, for further inspection, or they can simply log them locally.

  2. Snapshots: These are more comprehensive than logs; they capture the full stack trace and data.

  3. Metrics: It provides support for conditional metrics, timers, execution time between a certain part of code, measuring latency, and much more.

2. Logz.io

Logz provides a solution to centralized logs management. It handles scaling, sharding, and index management, moves old logs to storage using a smart system so that you wouldn't have to go through tons of logs. It also correlates the logs and metrics so that you can understand how those metrics were generated. It also provides extensive visualization and monitoring dashboards.

Logz provides over 50+ integrations to hook into any data source. You don't have to create your own data dashboard; you can just deploy the existing one provided by them. Moreover, you can analyze all of your data in Kibana and Prometheus side by side. Furthermore, Logz uses AI to identify issues on time before they actually cause any downtime. This essentially brings in the knowledge and experience of other engineers into your troubleshooting process.

Source: Logz.io

3. Datadog

Datadog provides both tracing and monitoring capabilities. It has a comprehensive set of features, such as log management, security monitoring, synthetic monitoring, and real user monitoring. Moreover, it provides an APM (application performance monitoring) tool, which can easily be used in Java applications.

Datadog's APM traces requests from end to end, especially when it gets tricky on distributed systems. It also monitors the end-to-end user experience using a web recorder so that developers can easily see where things went wrong on the user's side of things. Also, it correlates frontend performance with business impact by visualizing load times and splitting the data by custom attributes. This way, even if there are a lot of errors to go through, developers can start by fixing the most pressing ones.

Conclusion

One final point to make here is that a common mistake developers make is using a lot of debugging and monitoring tools. This can actually increase the debugging time and resources instead of decreasing them. Therefore, you want to be careful and strategic with the choice of tools. Generally, some tools allow for saving logs and results in a more concise and precise way than others. You should also probably think hard about the number of tools you want to integrate into your application and the footprint they are likely to leave behind. For each tool, you have to think about multiple factors such as cost, the value it provides, the ease of using it, and the amount of storage it will need.

Cover Photo by Nubelson Fernandes on Unsplash

Discover and read more posts from Josh Robins
get started