It’s a common story: a team builds software, their user base expands, and a few years pass with developers adding new features and fixing bugs. Slowly but surely, more alarms are fired and errors are thrown. It can be tricky to figure out the source of the alarm or exception if the code was written years ago. Logging and monitoring starts feeling like a game of Whack-a-Mole as problems pop up and disappear, seemingly at random.
This is the situation our customer found itself in with one of its popular security software products. After years of market success, the company found that its customers were using the software in ways they hadn’t predicted on bigger workloads than they had ever imagined. While it was terrific to see the software being used in new and creative ways, it created challenges that several different developers helped to address. Yet, with many developers contributing to the code, it created inconsistencies in the log levels between classes. Additionally, over time the team found that false alarms were being thrown, with logs that were so verbose that it became increasingly difficult to tell what was happening. As a result, engineers were spending more time than desired wading through logs to troubleshoot.
The development team decided things needed to change and turned to NTT DATA Services to help wrangle their logs into a manageable herd.
Our Site Reliability Engineering (SRE) team started by reviewing logs and talking with those team members responsible for log monitoring and root cause analysis. The team quickly found several inconsistencies in how developers added logging to the application, especially when it comes to log levels. Another major problem was that logger was set to send logs at a very low level or threshold, causing a lot of unnecessary information to be added to log files. Sifting through this unneeded data added unnecessary time to the log management process.