I have spent days on multiple occasions debugging for the root causes
Originally published @ https://stackup.hashnode.dev/curious-case-of-intermittent-issue
The ability to debug and troubleshoot issues not only saves time but also ensures the quality and stability of the final product and that is why coding skills are not the only defining criteria of an efficient developer/engineer.
I have been working as a Dot Net developer for over 8 years now and I won't be lying that I have had dreams of application issues which have troubled me during my work. I have spent days on multiple occasions looking for the root causes & fixes for issues that came into my bucket. Today I will be describing one such dreaded issue as part of the ongoing #DebuggingFeb
on Hashnode
.
It was an enterprise application in the Insurance domain built using DotNet MVC and SQL server. The app was already in Production with tons of users and I joined the team not more than 6 months ago and was working on a new module at that moment.
One fine day, certain users complained that the address entity was not getting saved in the database as part of bulk import while all other entities were going through properly. The support team tried replicating the issue but never faced it even once. They thought it to be a code issue triggered in a certain use case and I was pulled in to have a look.
Even I tried replicating on my local, QA, and staging environments but the thing just worked as expected. Also, it was a random subset of users who were facing the issue, the majority weren't facing and it was not the same set of users every time.
I then got hold of a few of the CSV files actual users uploaded in prod and tried with it and it worked again and on all three environments.
Once I realised it was not a data issue, the only thing I could think of was to look into the entire code and understand the nitty gritty. I remember spending more than 10 hours looking into every aspect of the code and trying to import dozens of files to figure out a scenario in which the import could fail but it kept working fine.
When I couldn't think of anything else, I thought maybe the users are doing something wrong. Since they were based in the USA, I asked the support team to get on a call with a few of them and record the screen on how they are trying to perform the task.
The very next day, I went through the recording and figured out that people who ever intermittently facing the issue, had to log out, refresh and log in to the app multiple times to replicate the issue on the call.
At this moment I realised that since they were logging in, again and again, to recreate maybe it had to do something with the servers. I went into the server documents and found out that we had three different app servers behind a load balancer. So the issue could have been in one of them and that's why it was so random and not reproducible at all in any of the lower environments.
I now knew the probable cause and needed to know if something had happened recently on the servers. I got the Infra team details and mailed them to find out if anything changed on the concerned servers.
They were quick to reply that a security patch was installed in two of them in the past week and both the servers were restarted as well.
I was now quite sure that I was on right track but still, I did not have a fix or the root cause. Next was to figure out which server is causing the issue.
I googled how to access the server bypassing the load balancer and found out that I can do so with the IP directly or by doing an RDP on the server and accessing the application directly there. I followed the second approach and in minutes I was able to recreate the issue on one of the servers.
I went through the services running on the servers and found that the service responsible to validate the address & make google API calls to verify them was stopped. Maybe it just did not start after the reboot.
Just starting the service did the trick and it was working again. I just had to right-click on the service and click on start. This was the fix for an issue which I was trying to figure out for more than 2 days now.
I had to share an email with RCA for the issue and it was then I got to know that the QA team was informed of such activities by the Infra team on the server and they used to perform smoke testing once the restart was done.
When they analyzed how they missed the issue it was found that they performed all the approved Test Cases and it all passed. But, they never tested any scenario where one of the servers might have had an issue, as they performed the smoke via the public URL which used the load balancer and they might have landed on the two working servers when they tested.
They were asked to update their test cases to test individual servers post such updates.
I realised that before digging into the code and spending hours looking for a fix, I should think of obvious reasons for something so intermittent and random.
In most cases, it is either a data or infra issue and that is the approach I have started taking whenever I encounter such not-so-straightforward issues rather than just looking anywhere and everywhere without any plan.
Maybe I was not equipped enough at that stage of my career to tackle that issue effectively and quickly but it was quite a learning curve.