Thursday, 2 July 2015

Logging compendium

Here follow some thoughts on how best to do application logging; they are very Java-centric and primarily based on my experience and various source of information googled along the way.

Implementation

Use log4j2/logback/name-your-modern-logging-framework

As soon as you can, move to a newer logging framework. Modern API/frameworks have higher performance, have a better API and more useful features.
I have settled for log4j2 although I use it via slf4j - for historic reasons. But others may do well too (namely logback!).

Use logOnce technique

When a LOG statement is in a loop and (a) that loop is infinite or (b) has the potential to be running many times, there's a risk to spam the log messages with the same entry on and on.

This could make the file grow indefinitely making it hard to read/analyse; it may also roll over causing loss of important logged information.

The logOnce technique prevents these nasty side effect and it works as following:

  1. define a special logger L that logs a log entry.
  2. log a message within a loop using the logger L, remember the last logged message
  3. next time you need to log on the same logger, the new message with the last logged message.
  4. if matches skip the statement log; if not go back to step 2.
Variations include
  • match new log statements using regex to allow for differences in some dynamic data
  • log every N entry: even if the new line matches the last logged line, log every N occurrences
  • record some metrics - like number of occurrences logged/skipped etc
Examples where you want to use this pattern are:
  • scheduled threads like Heartbeats
  • Log statements occurring in a for/while loop or recursive functions. especially when logs are on high log levels like ERROR or WARN

Other implementation details

  • Logging statements must be side effect free. Besides appending the statement to the configured appender, a LOG statement MUSTN'T change the internal application state.
  • Make sure that a log actually happens: either log BEFORE the action Or try { action } finally { LOG after } to make sure the LOG after always happens.
In a distributed application, make sure that
  • all nodes share the same wall clock - use NTP to sync
  • all applications share the same configuration; or at least the same message pattern layout.

Log levels

Logs *are* a user interface. And, as such, they deserve the right attention to details... (Yeah, right!)...

At a basic level I personally group INFO and above as forming Administration/Support user interfaces to the system runtime logs.

  • ERROR (aliases severe/fatal): for messages that need to be looked at immediately because a transaction or part of the system is in an un-recoverable state (albeit temporary).
  • WARN : for messages that need to be looked at immediately because a transaction or part of the system is went wrong but managed to recover.
  • INFO: descriptive messages about a meaningful system transaction happened. Or for long transactions start/stop.
On the other end, below INFO is for Developers.
  • DEBUG (alias fine): follow system transactions' execution with the intention of reporting the runtime flow.
  • TRACE (alias finer/finest): uber-details - to report application state or transmitted data, etc.
In essence, a support person looking at the logs should be able to determine whether an application transaction has occurred and what the final execution state is just by having INFO on.

Developers should be able to diagnose went on/wrong by looking at the DEBUG trace. That's why it is important to be able to simply turn DEBUG on/off possibly selectively per Logger or group of loggers.

Traces should only be used during development and never in production.

Markers

Experiment with Markers to provide a better partition of logs depending on application specific contexts.

Design

For transactions spawning across several threads in the same process OR different processes in the same host OR different apps in different hosts it's important that
  • Each log entry, in each file reports some ID to trace back the transaction
  • For example an operation starting from a browser client and going via a web services layer, consuming various microservices and/or middleware components should ideally report a single transaction id in all systems.
Provide a diagnostic context where appropriate to propagate internal state of a thread/process across other systems.
  • Speeds up diagnostics because it won't require trawling multiple files, dispersed in various locations
Log a single transaction contextual message in a single LOG line

Separate concerns by delegation and composition.

  • If you're logging latency of code blocks or entry/exit of code blocks, don't clutter the block, rather create a wrapper and use the wrapper (or use AOP)
Log different concerns in different files. For example
  • traffic data should go on a file,
  • application logs on another,
  • performance and metrics on another.
Link log entries with a common transaction ID.

NEVER log the same log entry in two different parts of the code.

NEVER log an entry in multiple lines (caveat informational banners etc)

You should never log at INF/ERR/WRN inside utility function or classes: let the caller log within the context of the application.

Antipattens

Antipattern 1 - Have to log caller class and line number
If you find the need to log the caller's file name and line number, something's wrong with your log OR with your pattern layout. Logging caller file and line number is very inefficient.

Antipattern 2 - Need to write a custom log file parser
If you find the need to write a parser to parse the logs, somethings wrong with your logs or with the way you use logs. Possible causes include:

  • You're logging too much and - for a given transaction - you cant work out what's going on
  • Your logs are too complex to read (Log statement don't have a constant and predictable format or are split in multiple lines - which will make writing the parser hard anyway)
  • You are using log files where you should instead use a proper "inter process communication" mechanism, or a proper auditing mechanism, etc
You should always get away with grep/awk/sed/tail - which are common tools in every platform.

Antipattern 3 - DEBUG (or even TRACE) has to be turned ON in production.
Your logs suck. The system performances will be degraded and you'll spend your days staring at your screen trying to figure out what's going on... the classic needle in the haystack.