How we trace our way out using Event-Tracing
Event Tracing is one of those topics which we aware to its effectiveness but always put it on low priority. Adding tracing to your flows and architecture (mostly effective on microservices ecosystem) is extremely crucial to find your hands and a legs when need to debug or manage crises accidents.
Scenario: We have multiple flows in our microservices system. Each flow has starting and multiple end points depends on the flow.
Unfortunately one of our services fail and the flow didn't end as expected -> Boom -> Alerts raising up, exceptions all over logs are running where shall we start? What caused the issues or which request triggered this trouble? So what do we do first?
Probably you are using log aggregator framework (ELK stack, etc..), We sorting the logs by services and time and now starting to figure out what happened. At this point we still struggle to see any clarity. When you have multiple complex flows and path's we can't really tell what was the chain of occasions which led this issue. Yes we can start using timestamps and yes we can start drawing architecture scenarios ( I remember those discussions: "Hey do you remember we had flow A triggered when flow B end's but alternatively it could trigger flow C incase those conditions are met??") figuring out to find our way. Without paying attention you start acting like "Hansel and Gretel" trying find their bread crumbs back in the woods.
Tracing could make your life easier. Specially in crises times when moments are under pressure. So.. "Keep calm and look at your trace"
If you could draw your path flow (just like airplane's GPS does) you could easily track and address issues
Adding Tracing system to your components isn't complicated but it does need to be consistent and a methodological work. The earliest you apply its standard the easier it will take to embed it.
Tracing getting more and more popular as complex systems find it's efficiency. The problem today is that we have different tracing framework vendors and we don't have real standart. However, nowadays W3C has an ongoing candidate for tracing standards:
https://www.w3.org/TR/trace-context/
I will keep it simple for you.
Tracing combine 3 main key values:
CorrelationId:
param which Indicating the complete tracing flow (will be passed between one service to another) - created once the flow start
ParentId:
param which indicating from where request has arrived(who called me?) - passed by previous process
EventId:
param that is used inside the service/process scope - created as sub-request
Those params should be used and/or created on each of your "entry-point" within your service/process.
Rollback Benefit:
Tracing enables your architecture to rollback cross-service operation.
I mentioned the correlation-Id param. Since correlation-id being passed between one process to another you can use it as a unique aggregator ID for "transaction". If you need to rollback the state/transaction all over your services services just mention the correlationId and you services should be smart enough to rollback the operation (On next post I will demonstrate possible ways how to do this in details)
Retry Benefit:
Tracing will give you the capability to retry specific events. Let's assume you have transaction going across all your services. But - bad luck, one of the services has terminated (due to some reason..) Now your system is into a state I call "Half Baked State" where part of your services applied the transaction and others didn't. Since tracing got eventId (mentioned above) you can tell specific service to re-try (after recovering) the failed transaction and outcome an alignment.
Other tracing benefits:
Distinct between different paths on your flow (without using timestamps or architecture thinking)
Datasource tracing: If you are using unique ID for datasource transaction record you can use event Id. This will extend your overall path into your datasource records (later on correlationId will expose all related event Ids which eventually will expose all affected records on different data sources
Monitor components as a whole. Since you have correlationId which is the same value used on all participated components you can monitor and tackle unwanted actions
Customer oriented. You can always return to your customer the correlationId and use it later if any business/non-business issues arise. Will be easier to track the flow by looking back for the assigned correlationId.
As you can notice the benefits are tremendous.
My Tip: Apply tracing as early as possible. It can be real pain to apply it once your system getting complex and wider.
On my next post I will review a couple of tracing vendors and explain how we apply tracing in our components and which frameworks we used to visualize them
taken from my blog at: http://idanfridman.com/2019/08/15/this-is-how-event-tracing-saved-us/
Cheers,
Idan