Saturday, 6 May 2017

Histograms and data driven aggregation

Unavoidably, once you start taking your work seriously as 'data science' you have to do hypothesis testing. And to do hypothesis testing you need to know the distribution of your data. And the most intuitive way to see the distribution of your data is to plot a histogram.

So in that context, we have a go at plotting a histogram of a field in our data. The advice of our 'big data' provider is - you guessed it - pull data from Hive into a Spark data frame, do some operations, convert to RDD, do some more operations. I'm too lazy for all that so digging around I found that Hive has a histogram function. You might not like the idea as it returns an array of structures that contain the bin centres and the respective frequencies, and it uses some funky binary delimiters, different for the struct fields, the array elements and of course the fields returned by the query. This is complicated enough to merit its own post which I promise to do in the future, but in my book still preferred: No need to do 20 (or 50?) lines of configuration and functional programming where a SQL one-liner would do.

Anyway, having done that I was looking at another field for which we also needed a histogram, and realised that it is really a discete measurement, it was a number somewhere between 40 and 80 that only came with .00, .25, .50 and .75 in the decimal places. Maybe an unsigned 8 bit quantity at the point of measurement/analog to digital conversion? Anyway, that means that to do a histogram you can avoid the binning all together, the data is pretty much 'pre-binned'. Instead it becomes more like the first classic example of any Map Reduce related tutorial: a word count.  How many times does e.g. 50.25 appear in the data, and likewise for all values.

Knowing your data can always save time and effort when trying to analyse it. A key reason to like Tableau is the way it allows you to learn what your dataset looks like and explore it from all possible sides. I have to confess though, the final dataset was simple enough for the histogram to be done as a pivot-chart in Excel!

No comments:

Post a Comment