суббота, 27 февраля 2016 г.

Picking a simpler approach to aggregate data in Clojure

The only purpose that I use Clojure for now is talking to our Jira to extract some statistics for analysis and even this still brings a lot of opportunities for discoveries and revelations. The most recent task that I set for myself was to get data on bugfixing activities over some period of time and store it as a table for further analysis with Excel.

I wanted to transform a list of Jira items into a table that shows how many items of different severity each team member had fixed on a certain day. I already had a way to talk to Jira so the key part of the task was to aggregate the list into a table with 3 attributes and one numeric value - the count of items holding this combination of attributes. Aggregation is easily done with reduce so I only needed to chose the form of the result. My first natural response to a problem of this kind is to assemble a structure of nested dictionaries with values of attributes as keys and summed count of items as leaf values, something like this:

{ "2016-02-10"
    { "Ivan Petrov"  { "Major" 1 "Normal" 2 }
      "John Stone" { "Critical" 1 "Normal" 1 }}}

It turns out that Clojure 1.7.0 offers the update-in function that works greatly with nested maps. The thing takes the hashmap, a sequence of keys and a function. It would first retrieve the value currently stored in the nested map under the specified sequence of keys, apply the function to that value and store the result back under the same keys. Thus transformation of the list of items into an assembly of nested maps holding aggregated values will look like this:

(defn resolved-bugs-stats [issues]
    (reduce
        (fn [report {date :resolutiondate assignee :assignee severity :severity}]
            (update-in report [date assignee severity] (fnil inc 0)))
        {}
        issues))

This piece of code yields the figures that I want - the only step left is to transform it into a sequence of rows and this one took me much more thinking. Despite the fact that I was able to find a solution, I also realized that I don't need the nested structure at all. (That is availability of update-in turned out to be a misfortune).

The essence of my revelation was very simple: why would I build a map of maps of maps if I need a list in the end? Since Clojure lives great with vectors as keys I could just use a composite key and go with a one-level hashmap. This single level is still required if I want to have an easy way to sum up the count of items with certain values of attributes, but unlike a nested map it transforms very easy into a simple table. Here is the new function - it looks almost the same but produces a simpler result and, what's more important, makes the code that uses it way cleaner:

(defn resolved-bugs-stats [issues]
    (reduce
        (fn [report {date :resolutiondate assignee :assignee severity :severity}]
            (update report [date assignee severity] (fnil inc 0)))
        {}
        issues))

; generates a result in the form:
{
    ["2016-02-10" "Ivan Petrov"  "Major"]   1
    ["2016-02-10" "Ivan Petrov"  "Normal"]  2
    ["2016-02-10" "John Stone" "Critical"]  1
    ["2016-02-10" "John Stone" "Normal"]    1 }

While modern programming languages offer powerful tools to make complicated solutions real and cheap, there are few cases where the form of intermediate data structures need to be significantly more complex than the form of the desired result. Simply keeping this idea in mind and evaluating our solutions against it may help avoid some of the excessive complexity that we introduce when building systems.