It’s one of those days again. You are tasked with finding the root source of an intermittent production problem in a web application. To be able to scale out on demand, QueueMetrics Live, our hosted call-center analytics platform for Asterisk PBXs, is built as a cluster of differet services who talk to each other over HTTP - so chance has it, finding what went wrong and how will begin (and often end) by looking at some web logs.

Web logs are produced by a plethora of different systems - not just web servers, but reverse proxies, user-facing web applications, API servers, databases - you name ‘em. And there is one thing they all have in common: they are all slightly different.

All web logs seem to be ideally derivative of the old Apache log format, but every application customizes them somehow. And as you don’t really know what you are looking for, you need a tool that maximises flexibility and quick interaction over brute parsing efficiency. And a Clojure REPL is very well suited for the job.

Enter the REPL

The Clojure REPL is a very handy tool for exploratory, ad-hoc analysis. By writing a minimal set of common functions, you can easily import data and then use the full power of Clojure to filter, aggregate and correlate.

You can run these examples is a “pure” REPL, the kind you get when running

lein repl

but in practice we find it more useful to either use a GUI to save your intermediate results, or to use Gorilla for a nice browser-based REPL.

Some building blocks

We start by defining where our logs are stored, to avoid repeating ourselves:

(def logfile "./p018_access_log.2017-09-12.txt")

The first Lego block we build is a function to read a log file into a vector of lines. We have a single-arity call that reads all of the file, and a 3-arity call that lets us read a ‘page’ - a fixed number of lines starting from a given line offset:

(defn log-slurp 
  "Reads a log-file into a vector of lines for parsing."
  ([fname]  
   (log-slurp fname 0 999999999))
  
  ([fname from nlines]
   (into []
     (drop from
        (take (+ from nlines) 
              (line-seq (clojure.java.io/reader logfile)))))))

Our second Lego block will be a record to store our data. You may or more likely may not need all fields in the record, but we found it hand to have a single structure with consistent names we recycle across analyses.

While we are at it, we will also create a simple function to parse strings as integers that handles the common case of “-“ for zero-length operations:

(defrecord LogRecord
  [ip date time method uri referrer status useragent bytes])

(defn ->int [s]
  (try
    (Integer/parseInt (re-find #"\A-?\d+" s)) 
    (catch Exception e 0)))

Now for the parsers themselves. The basic of a parser is a regular expression that captures the fields we care about, and a function that converts a line into a LogRecord, or returns ‘nil’ on failure. For example, meet parser A:

(def log-pattern-A #"^(\S+) \S+ \S+ \[(../.../....):(..:..:..).+?\] \"(\S+?) (\S+?) .+\" (\d+) (\S+)")

(defn log-parser-A [line]
  (if-let [[_ i d t m u s b] (re-find log-pattern-A line)]
    (->LogRecord i d t (keyword m) u "" (keyword s) "" (->int b) )))

As the pattern is so important, we find it easier to keep it in a separate var. You’ll notice that we don’t really fill all fields in the record - it depends on the log data we are processing.

By the way, tracking parse failures is extra important because this means that we can at once chain multiple parsers, and keep a list of lines that were not parsed successfully - as there is one output for each input line, lines that appear in the same position as ‘nil’s are parse errors.

So we can compose all calls into a single parser that tries them all in sequence and returns a LogRecord:

(defn log-parser [line]
  (or
    (log-parser-A line)
    (log-parser-B line) ))

Using more than one parser is very handy, especially when loggers went nuts and you have incorrect lines.

Now for the meat of our process: given a log file, we parse it into a sequence of “correct” records and of unparsable lines.

(defn logparse 
  ([file]
   (logparse file 0 99999999))
  
  ([file from nrows]
   (let [rows     (log-slurp file from nrows)
         records  (map log-parser rows)
         errlines (map #(if (nil? %1) %2 nil) records rows)]
     
         {:RECS   (filterv some? records)
          :ERRS   (filterv some? errlines) } )))

So far, we can basically copy and paste all functions above into the REPL as they are.

Exploring

As this is an exploratory analysis, it is very important to see which lines are unparsable and to adapt the regexps in our parser - or create new parsers to process different line formats.

We start by parsing only the first few lines of our file:

> (logparse logfile 0 3)

{:RECS [
	{:ip "10.0.25.232", :date "12/Sep/2017", :time "00:00:06", :method :POST, :uri "/av/jsonQLoaderApi.do", :referrer "", :status :200, :useragent "", :bytes 136} 
	{:ip "10.0.25.232", :date "12/Sep/2017", :time "00:00:11", :method :POST, :uri "/av/jsonQLoaderApi.do", :referrer "", :status :200, :useragent "", :bytes 136} 
	{:ip "10.0.25.231", :date "12/Sep/2017", :time "00:00:16", :method :POST, :uri "/av/jsonQLoaderApi.do", :referrer "", :status :200, :useragent "", :bytes 136}], 
 :ERRS [] }

In this case, all went well. If you re not that lucky, try tweaking the regexp in your parser until all of them work. Extend the number of lines as you see fit, until you can parse the whole file and :ERRS stays empty.

The fact that you can just edit the regexp and run the analysis again without an explicit compile step is a major time saver; do a change, reload, see the results immediately.

Analyzing

We can now store the result of our analysis into a var, so we don’t have to load it every time:

(def RECORDS (logparse logfile))

Make sure you have no parse errors:

>(count (:ERRS RECORDS))
0

Now, a typical query would be made by grouping a set of records byy some attributes into a map, and then printing how many items are present for each group. The function below does just that:

(defn count-occurrences 
  [[k v]] 
     [k (count v)])

For example, if we want to see how many entries got a specific HTTP status:

>(map count-occurrences (group-by :status (:RECS RECORDS)))
([:200 5204] 
 [:302 317])

So we are lucky - no errors above.

Another usual occurrence would be filtering by some attribute (eg. finding only calls that have a status :200, or some specific path), and then grouping the results:

>(let [all200s (filter #(= :200 (:status %)) (:RECS RECORDS))]
>   (map count-occurrences (group-by :uri all200s)))
(["/av/jsonQLoaderApi.do" 4451] 
 ["/av/sysup.jsp" 749] 
 ["/av/login.jsp" 4])

As you can see, once data is loaded, grouping and filtering is trivial if you know just a bit of Clojure. And clojure.core is full of powerful functions that will just work for any filtering/transformation task you may conceive.

Lessons learned

Clojure is a language that is especially well suited for interactive, ad-hoc analysis. While you’ll have to load some minimal plumbing, you have a full programming language at your fingetips, with sophisticated functions for filtering, merging, joining and whatnot all included in its standard library.

The second thing that you’ll notice, compared to writing a small program in another language, is that in the REPL you do not really care about errors. If you make a mistake, or something bad happens - well, you fix it. Very little defensive programming is needed.

The third noticeable thing is how you save a lot of mental effort because you do not have to switch between an editor and an execution - everything is in the same place.

The last thing we learnt is that leveraging on a common dictionary of a few loading fuctions and data formats, we can be very effective in finding what happens and still be able to cater to unplanned occurrences.

Support-line blues?

QueueMetrics Live is a premier hosted call-center platform for your Asterisk-based PBX. Ever wondered why people complain about your support line? Now you can know for sure, in 5 minutes, with no hassles. Try it!