Tuesday, 28 March 2017

Statistics of statistics

In looking at the pie charts of CPD hours we found that the sum of averages was the average of the sums, whereas the sum of medians was much smaller than the median of the sums. This is another way of saying that the average is a linear function, i.e. it is true that

f(ax+by)=af(x)+bf(y)
whereas the median is non linear. This is quite important from a visualisation point of view as pie charts, stacked bar charts, area graphs etc. imply that the sum of the parts is a meaningful quantity, and in the case of non linear aggregations (median, countd) often it isn’t.

In tables Tableau addresses this with ‘grand total’, the aggregation for which doesn’t have to be a sum but could be e.g. an overall median. If you’ve been careful to never imply the parts can be summed but still find your users exporting data and doing sums in excel, adding a table view with a grand total configured to the suitable aggregation can save you from hours of arguing!

Another case of statistics of statistics in Tableau can arise when using Level of Detail Calculations. I used to do this manually by doing counts at ‘exact date’ level of detail, exporting the counts to excel, re-importing the counts to tableau and then finding the weekly maximum of the counts, effectively using Tableau as my data warehouse/ETL tool as well as the analysis and visualisation tool. The emergence of Level of Detail calculations saved me from all this bother, as now I could plot a
max({fixed [date]:count([x])})
against the date to the nearest week.

Of course there are also cases of using data from a proper data warehouse, whether the traditional RDBMS one or Hive. In that case again it is worth being careful to match any aggregation done in Tableau to the aggregation done in the data warehouse. e.g Min([min_x])can’t go wrong, but the averages can be a bit tricky. Say the original SQL was
SELECT avg(x) AS avg_x GROUP BY d
where d1 has 1 record and d2 has 100! Coming then in Tableau to do an avg([avg_x]) is just asking for trouble. Instead modify the SQL to
SELECT avg(x) AS avg_x, count(1) AS n GROUP BY d
and then we can get the proper average in Tableau with
SUM([avg_x]*[n])/SUM([n])