There are at least three reasons why we want to log messages: program correctness, performance and application event monitoring. Each log message has a different audience: software engineers care about program correctness, dev-ops engineers and administrators care about performance and product managers and support engineers care about application events. This document discusses how to figure address each concern. It also makes the case for making logs easy to read by programs and not just humans.
It seems like it should have an obvious answer. Why do we log things in online applications? To know what happened, obviously. But I assert our intent falls in at least three different categories:
In each of these cases, there is benefit to having an automated system monitoring logs. But the specific type of system monitoring the logs may be different for each category.
In the tracing case, maybe it's an IDE (Integrated Development Environment) reading the logs, attempting to reconstruct what happened in the program.
Much has been written already about making telemetry observable [1] [2]. Telemetry data is typically fed into a system which can (at least) warn human operators something is going wrong and (hopefully) take some basic actions to avert disaster.
As for the "application domain" category, these are log messages which assert something meaningful to the customer has occurred. I sometimes call these types of logging messages entropic because of their irreversible nature. You've either processed a request or you haven't; and once a request is processed, it's hard to unprocess. It's a one-way operation.
There's some overlap between telemetry and app domain log messages. I like to think of "telemetry" being focused on the performance of the systems which implement the application while "app domain" log messages are focused on events that are meaningful to the customer. Things like: "we received your request", "we updated the state of the application in response to your request" or "we think we're finished with this request."
App domain messages might be useful to produce Service Level Agreement (SLA) or Operational Level Agreement (OLA) metrics. But if you look at the list of Common Metrics in the Wikipedia's "Service Level Agreement" article [3], you'll see metrics that could fall into either the "telemetry" or "application domain" categories. When I try to decide whether a log message is a telemetry or app domain message, I think about its meaningfulness to the customer.
Another way to think about it is to ask if a particular log message tells you about the performance of an underlying resource (telemetry) or whether the recorded event moves a customer's business process into or towards a subsequent state. If the "business side" of the organization cares about it, it's probably an application domain metric. If the "tech side" of the organization cares about it, it's probably telemetry.
Larry Wall famously quipped "computer languages differ not so much in what they make possible, but in what they make easy." [4] We can borrow that quote to produce this bit of wisdom: "logging systems differ not so much in what they log, but in how easy it is to read the logs." If you don't make a log message that can be consumed, why are you emitting that message?
Consider this log excerpt:
2021-07-06T09:25:17.813Z DEBUG Entering foo() to process request d44dcf8c-df41-11eb-9186-6723ffa51662 with new state 16 2021-07-06T09:25:18.501Z DEBUG Current state is 14. 2021-07-06T09:25:18.556Z DEBUG Illegal state detected
This tells you (the human) everything you need to know. But it's somewhat hard to parse. If you wanted to make a tool that searched through the logs to find requests with illegal states, you would have to know about the specific text in these messages. If someone ever changed the text of the log message, you would have to update your log search tool.
But if we had logs that looked like this, we could more easily process the log with a program:
2021-07-06T09:25:17.813Z DEBUG foo_bar.js:56 ENTER {"function":"foo","id":"d44dcf8c-df41-11eb-9186-6723ffa51662","new_state":16} 2021-07-06T09:25:18.501Z DEBUG foo_bar.js:68 CURSTATE {"current":14} 2021-07-06T09:25:18.556Z DEBUG foo_bar.js:70 ILLSTATE
Human readable logs are important. The example above is reasonably easy to read, but maybe this is slightly better:.
2021-07-06T09:25:17.813Z DEBUG foo_bar.js:56 [ ENTER, "Entering function to process request", {"function":"foo","request_id":"d44dcf8c-df41-11eb-9186-6723ffa51662","new_state":16} ] 2021-07-06T09:25:18.501Z DEBUG foo_bar.js:68 [ CURSTATE, "Current state", {"state":14} ] 2021-07-06T09:25:18.556Z DEBUG foo_bar.js:70 [ ILLSTATE, "Illegal state transition detected" ]
The precise format is not as important as choosing a format which can be parsed easily by both humans and machines. For inspiration, consider Antlog [5], a Node.JS compatible package that (among other things) makes it easier to produce easily parsable log messages.
Personally, I much prefer gdb to a list of printf's. Yes, it takes time to learn how to use a "real debugger" -- but it's worth it. The flexibility you get from a "real" debugger justifies the effort in learning how to use it. "Real" debuggers are available for most (if not all) common platforms, even Node.JS Lambdas on AWS. (See the "AWS Toolkit for JetBrains" [6])
But debugging isn't the whole "program correctness" story.
"Tracing" a program is a technique used extensively in embedded programming. Time-sensitive embedded software often can't wait for a programmer to single-step through a program. Specific aspects of the embedded program's state are stored in a buffer as the program executes. The buffer is then transferred to a development system and examined after-the-fact, looking for incorrect behaviour.
GDB, for instance, allows developers to record the state of an application and play it back later, even allowing the programmer to step backwards in time from within the debugger. [7] Tools like Percepio's Tracalyzer give you an interactive graphical environment to search for bugs or good or bad execution patterns. [8]
Tools for tracing AWS Lambdas are comparitively primitive, however. AWS X-Ray [9] and various OpenTracing tools [10] focus on performance, not correctness.
But to make use of tracing tools in online applications, you have to instrument your code. It should be straight-forward to instrument your code to emit DEBUG log messages, then use a tool to collect, display and analyze data collected.
Building a general-purpose tool to ensure program correctness exclusively from DEBUG logs is not simple. How would the program checker know if the program was behaving correctly unless you wrote a description of the correct behaviour? Languages like TLA+ [11] or Kami [12] are used to describe the state of software and hardware and to verify programs and circuits move from known states to other, correct known states. But to use them properly requires a fair amount of study.