A few weeks ago, I was asked to review a pull requests and amongst some of the suggestions I made was that I would really recommend adding more logging. It was only later when a member of the team asked me in slack what exactly I meant by that when I realised, I should have perhaps been more explicit in my suggestion. I think the question that was raised was –
“How much logging should I add?”
The now cliched “Well, it depends” comes to mind but for me the two core questions that I ask myself while I’m developing something are –
- If something unexpected happened while the program was running, do I have enough logging in place for me to diagnose the issue efficiently and effectively?
- During the execution of the program, do I have enough logging in place such that by examining the logs it will satisfy me that the program is executing in the manner that I expect?
I think it’s also important to also differentiate between the two concepts of logs and metrics.
I like to think of logs as a series of events that are written as execution takes place. These events will also contain one of more pieces of information about the event. For example –
2020-10-10 07:43:17,581 26280 [INFO ] - [NuGet] Attempting to resolve dependency 'visualstudio-installer (= 2.0.0)'.
The example above shows a fair amount of information, a timestamp, the type of log (
Info in this case), the process logging the information (NuGet), a brief generic description of the event and some specific information upon which the event is occurring.
Metrics on the other hand are things you can count e.g. how many times has a specific event occurred? how long did it take? how many of these <messages / files / records etc> are you about to process?
Two of the major benefits about metrics is that you can both track them and visualise them. So, if you release a new version of your software and feedback is that it feels slower (and you have metrics around performance) you now have data to compare this and maybe even begin to isolate the issue.
So, both logs and metrics are important, you can of course derive metrics from logs but it’s perhaps not as efficient as recording (and aggregating) the metric itself.
In terms of methodology, I wouldn’t say that I’ve progressed to full Log-Driven Development but tend to rather iterate over the feature I’m developing adding and adjusting logging as I go. What I’ve found is after deployment the log level usually needs to be adjusted. For example, running the system locally with a log level of ‘
Info‘ is usually fine, put a production load through it and it can become a bit too noisy. It’s all a bit of a balancing act between keeping log volumes under control and having the correct log levels configured so that if something does go wrong you have as much useful information as you possibly can.
One thing I read about a while back was a technique where logs are buffered in memory and on a critical failure these are flushed to the disk, giving you all the necessary log context while not suffering huge amounts of logs being written to disk. I haven’t had a requirement to do this but I believe it’s called a ring buffer. The option to switch the log level dynamically is also an option here but it would really be ‘after the fact’ and the same issue would need to occur to capture the log events (perhaps there are additional ‘
Debug‘ level events that can be written). I’ve described one technique to do this in one of my earlier posts.
Also, I think for me, the care taken to implement high-quality logging is an indication of a high-quality application. It shows that time and thought have gone into the implementation and given consideration to the maintainers and engineers that support the application in production. I think it also leads onto another more general question that I tend to ask myself before handing this over to someone else or perhaps another team –
If I was given this application to run on a day to day basis in Production, how confident am I that I could resolve an issue in the early hours of the morning when no one else is about?
Logging is of course only one part of what makes up a robust and supportable application but for me, it’s probably right up there as one of the most important.
So how much logging do I add? As much as I can.