Tuesday 28 March 2017

Statistics of statistics



In looking at the pie charts of CPD hours we found that the sum of averages was the average of the sums, whereas the sum of medians was much smaller than the median of the sums. This is another way of saying that the average is a linear function, i.e. it is true that


f(ax+by)=af(x)+bf(y)
whereas the median is non linear. This is quite important from a visualisation point of view as pie charts, stacked bar charts, area graphs etc. imply that the sum of the parts is a meaningful quantity, and in the case of non linear aggregations (median, countd) often it isn’t.

In tables Tableau addresses this with ‘grand total’, the aggregation for which doesn’t have to be a sum but could be e.g. an overall median. If you’ve been careful to never imply the parts can be summed but still find your users exporting data and doing sums in excel, adding a table view with a grand total configured to the suitable aggregation can save you from hours of arguing!

Another case of statistics of statistics in Tableau can arise when using Level of Detail Calculations. I used to do this manually by doing counts at ‘exact date’ level of detail, exporting the counts to excel, re-importing the counts to tableau and then finding the weekly maximum of the counts, effectively using Tableau as my data warehouse/ETL tool as well as the analysis and visualisation tool. The emergence of Level of Detail calculations saved me from all this bother, as now I could plot a
max({fixed [date]:count([x])})
against the date to the nearest week.

Of course there are also cases of using data from a proper data warehouse, whether the traditional RDBMS one or Hive. In that case again it is worth being careful to match any aggregation done in Tableau to the aggregation done in the data warehouse. e.g Min([min_x])can’t go wrong, but the averages can be a bit tricky. Say the original SQL was
SELECT avg(x) AS avg_x GROUP BY d
where d1 has 1 record and d2 has 100! Coming then in Tableau to do an avg([avg_x]) is just asking for trouble. Instead modify the SQL to
SELECT avg(x) AS avg_x, count(1) AS n GROUP BY d
and then we can get the proper average in Tableau with
SUM([avg_x]*[n])/SUM([n])

Monday 27 March 2017

Politicians for analytics and analytics for politicians



MP Daniel Zeichner (Labour) has set up a data analytics all party parliamentary group. The membership of the group includes Labour and Conservatives, a token Lib-Dem and an ex-hereditary cross bencher lord. The SNP, a far more important player in this parliament than the Lib Dems, is not represented.
On the other side of the political mainstream, David Willetts as a Universities and Science coalition minister a few years ago identified Big Data as one of the 8 great technologies that could guarantee of future growth of the British economy along with space, robotics and autonomous systems, synthetic biology, regenerative medicine, agri-science, advanced materials and energy. Space was probably closer to his heart, and he retired to a position on the board of satellite maker SSTL.
Of course data analytics have been at the heart of British political debate for a while now: the fixation with quantifying performance and ranking schools and hospitals has been central to any discussion on education and the NHS, the central issues for the Blair and Cameron governments correspondingly. The Blair government has been pivotal with its ideas about evidence based policy for our supposedly post-ideological times.
We’ve already seen the increasing importance of data in local government which is tied to the emergence of ‘Smart Cities’, while mobile telecoms mined data is used for things like transport planning. For that most political of issues, data protection, the increasing contradictions between the incoming EU GDPR and the UK’s snooper charter might be partially resolved by Brexit, though the UK in its typical way might be quite relaxed about following EU regulations while still obliged to do so, and the media will play its part in making snooping palatable to the voters.  

Monday 20 March 2017

data @peterborough.gov.uk

Following on from the talk from a Peterborough council employee in Big Data world, I had a look for their open data. They have an impressive portal at http://data.peterborough.gov.uk/ , the platform in use (Data Share) is developed by the London borough of Redbridge.

I chanced upon the topical dataset of split of social and affordable housing allocations between British nationals, Europeans and other foreigners. This is of special interest not only because the issue of EU citizens benefits has been at the centre of the Brexit debate, but also because Peterborough has been singled out in a couple of newspaper articles as a place that helps understand the brexit vote (already from the last general election, following the UKIP referendum campaign, and in the aftermath).

Data share has a funny start page where you follow one link to view data and another to download it, each leading to categories of datasets, and each category's link to the list of datasets in the category. Quite a lot of clicking through, especially if you first view and then decide to download. Thankfully there is actually a download button in the viewing area.

Viewing the data shows a table but gives some other options, the interface reminds me a bit of the built in visualisations in the Zeppelin Notepad.
This doesn't live up to the promise though, try any view other than the table and what you get is a visualisation of the number of records in the table per value of the dimension selected for the category axis, not even fitting in one page and not in chronological order either!
It does look like the Peterborough data and the Data share platform haven't been tuned to work with each other when it comes to visualisation. Changing the dimension for the category axis becomes even more revealing:
So this is a bit like the modal value of my CPD hours :) . I have a visualisation telling me that there are two records/quarters in the dataset for which there were 197 houses allocated to british nationals, as opposed to having only one quarter for which 200 houses were allocated. This might interest a conspiracy theorist with numerology fixations, but is far from insightful. Downloading the data and playing with it even in excel can get you a bit further, though the 'Apr-Jun 10' format is not great. Instead we load into tableau, split the period and modify the year to four digits:
'20'+TRIM( SPLIT( [Period], " ", 2 ) )
We also pivot the various nationalities to give a more tableau friendly format
 The remaining issue is the occasional 0-3 value, which might be there to 'anonymize' the dataset rather than single out the hypothetical one Czech family that got a council house in a particular quarter of a particular year. Change the data type to number(whole) which is fine with all other values and gives null for the 0-3 and then add another calculation
ifnull([Pivot Field Values],3)
I could have used zn and turned them to zeros, but if the hypothesis we are investigating is 'the bad immigrants take all the houses' we need to take the 'worst' case scenario. Now the total number of houses allocated fluctuates so looking at percent of total allocations with the table calculation computed at the cell level (i.e. percent of the total for the quarter) we can see how the relative percentages of nationalities fluctuate. As the nationalities are already partially grouped, I group the eastern Europeans in with the rest of the EEA nationals. I also stick in one crucial number from the UKIP campaign Grauniad article: 79.4% of Peterborough's population were born in the UK. Of course there are plenty of British nationals not born in the UK, but lets use what numbers we have at hand.

The end result shows that the percentage of British nationals getting social and affordable housing hovers just under the percentage of British born people in Peterborough. I can guess here that if you are rich or at least well off in Peterborough you are more likely to be British (an assumption that wouldn't necessarily hold in London), so we really need the percentage nationalities of people who can't afford market rent in Peterborough. But in any case it shows that things are not as bad as the benefit tourist story wants you to think. After all many of the EEA nationals in council housing could be working in the warehouses mentioned in the more recent Grauniad article. Which boils the question down to why does the British government not enforce a living wage more strictly, if the benefit seekers are a drain on the national finances.

Sunday 19 March 2017

EU data protection compliance

As promised here's a look at where the data protection compliant countries lie on the map. The geospatial insight? They could be anywhere in the world, unless they are a tiny island state/dependency in which case they have to be within about 10 degrees east or west of the Greenwich meridian!



Saturday 18 March 2017

Pie chart epic fail

I've just submitted my continuous professional development records to the professional body I'm a member of. They use a rather clunky data entry web form, and once everything goes through you get some dashboards (of course!). The code behind dashboard generation is quite inflexible, it really couldn't believe I would give it zero for red, blue and green metrics, so it still zooms on them in a second pie chart, where 3 zeroes take a third of the pie each. Is there an official name for this shaded pie-chart? Soup bowl perhaps? 
My recorded CPD hours can be compared to the whole set of the same membership category as me, whose results come in pie charts.

set mean
set median
Now something interesting is happening here, as the median of blue and green is 0 while the mean is non zero. This means at least half the members in the set have done zero blue and zero green hours, much like myself. Also hilariously the sum of the median hours is less than the obligatory 30. This goes to show why it is such a bad idea to do a pie chart of medians, as a pie chart implies a total value for the whole pie with some real meaning.

Now going to the bar chart showing all categories with a choice of metric between Mode, Mean and Median. Mode is a really odd choice for what is really a continuous metric, and sure enough, by selecting it I get an error:
One or more CPD Hours sets has no single modal value
 The means and the medians are shown below
Mean
Median
 The meager 30.3 is me. The dark green has the same mean and median (124) so I would hazard a guess that it is a set of only one member, especially given the mean is round. It's interesting that for the other sets the median is just over half the mean, but we know that at this point they might also have very few members values all over the place. Or an uber keen minority in each set is skewing the mean upwards.

The blue is my set. The mean of 107.2 matches the total of the set average pie chart. That is not the case with the median, but of course the sum of medians is not the median of the sums, whereas the sum of averages is the average of the sums. The fact the sum of medians is actually much smaller than the median of sums, shows us that most members of the set don't divide their time equally. There's scope for a whole other post on statistics of statistics here, I'll do that some other time.

Wednesday 15 March 2017

Big Data World day 1 review

This event was today at Excel in London. It is handily co-located with Cloud and Smart IoT events, which was a major draw for me, as I am not strictly speaking a data/IT guy, more of an analyst and wannabe subject matter expert with IoT and increasingly cloud within my professional interests. As the write up is quite long the following list of links lets you jump to the summary of the relevant presentation:


I did start from the Cloud Expo end of the room as there was a whole 'theatre' devoted to talks on software defined networks. Judging from the first presentation which looked at SD-WANs, these guarantee the exact opposite of net neutrality to business users, prioritising cloud business applications at the expense of e.g. YouTube traffic. The technology involves 'aggregating and orchestrating' a variety of links of varying performance and availability from MPLS and 3G to old school internet, and intelligent path control for different applications' traffic, e.g. routing VoIP traffic through MPLS, thus being able to guarantee QoS above 'best effort'. Interesting ideas on SLAs for application uptime and even application performance could extend the current SLAs of network availability. To my simplistic brain, it would all be simpler and safer if we keep most applications offline and standalone, instead of putting everything on the cloud, but there's a cloud enabled industrial revolution in progress, which involves cost cutting and getting rid of half the IT department. Clearly that is the main driver, with the downside being increasing demands on the network.

In complete contrast, I moved on to a presentation in the Big Data World area on the General Data Protection Regulation. This made me feel suddenly quite nostalgic for the good old EU, its much maligned courts and the human right to a private life. The list of non EU countries that conform  to european data protection standards is interesting to say the least: Andorra, Argentina, Canada, Faeroe Islands, Guernsey, Israel, Isle of Man, Jersey, New Zealand, Switzerland and Uruguay. Maybe there's material there for a tableau map and some spatial insights!

Next was SelectStar, a database monitoring solution. Their main selling point is being a single monitoring tool for all database technologies, as increasingly companies will have a mixture of open source and proprietary relational and NoSQL solutions, rather than be an e.g. 'all Oracle' house. I was hoping to get some tips on actual monitoring of the health of a Hadoop cluster, but they don't do much in addition to what the built in Hadoop monitoring tools offer, in fact Hadoop is quite a recent addition to their offerings.

I try and balance attending presentations relevant to my job to things that are just plain interesting. In the spirit of the latter, I went to the 'deep learning' keynote, which as one member of the audience noted, was using mostly image recognition examples. The presenter explained that they make good presentation material, and the technique has wider applications. The key thing I took from this was that the feature extraction is now also 'trainable', whereas in old school machine learning only the classification stage was. I'm not fully convinced machines are better than humans at deciding what the best feature vector is, and I should read up on any speech/music processing applications of this as I have a better theoretical grounding in that field. Would a machine ever come up with something as brilliant but crazy as the cepstrum?

I next attended a talk on deriving 'actionable insight' from data. This is now such a cliche that is a bit meaningless, especially when the whole brigade of viz whizz kids are into dashboard visualisations with little accompanying text, carrying little of the insight a well illustrated text heavy report will give you. Refreshingly, and perhaps tellingly on the differing priorities of the private and public sector data world, the speaker used no visual material whatsoever. He just talked about his work as a data analyst for Peterborough city council, and projects such as identifying unregistered rental properties by fusing a variety of datasets including council tax and library user details, or analysing crime patterns! I should look into what open data offerings they have.

The weather company, finally a (nearly) household name, was next, in a keynote on weather for business. They interestingly crowdsource some of their data to achieve 'hyperlocal' resolution of 500m, 96 times a day globally, while a major sector of their corporate customers, airlines, are also data providers.  They have a unique data processing platform and the speaker did put a certain emphasis on their acquisition by IBM and the 'insights and cognition' synergies that will result from it.

I then ventured into the Smart IoT area for a talk from O2/Telefónica. The data protection theme was present here again as they make sure their data is secure and anonymous via aggregation and extrapolation, and a certain amount of weeding out of any data that is not aggregated enough and therefore could be used to identify individual users, and users consenting to the use of their data also came up during question time. They derive location via both 'active' (calls) and 'passive' (cell handovers) means, as well as behavioural data from apps and websites visited, and 'in venue' information from their wifi hotspots and small cells (another hint at the seamless integration of multiple networks mentioned earlier). This builds up to 6 billion events on a typical day, and they keep an archive of 4 years. These events are classified into 'settles' and 'journeys', analysed to identify origins and destinations, with uses ranging from transport planning, audience profiling, retail decision making etc.

Back to the other end of the hall to hear self professed ubergeek Andi Mann of Splunk on DevOps . Follow the link as he can summarise his ideas much better than I can. He gave me another interesting fact on Yahoo's past as a cool tech parent company, despite today's headlines on security breaches: the presentation on dev and ops cooperation at Flickr that was key to starting the devops conversation back in 2009. I think the idea of devops has applications outside software, in fact a lot of the operational intelligence work I do sits somewhere between operations and product/service development.

I did skip the QlikView presentation for the sake of Splunk, but came back to the Big Data keynotes for Esri. Their presentation focused on the benefits of spatial analysis, giving the grid aggregation I wrote about in the past as an example, along with navigation, fraud transaction detection and even insurance for ships going into pirate infested waters!

Finally a joint act between Wejo and Talend on their connected car project. This was interesting for many reasons, from the technologies involved on the car such as eCall  to the technology used for data processing, as Talend offers an open source ETL tool that can sit on top of Spark. On that latter front there was a certain emphasis on the benefits of having a unified infrastructure for batch processing and streaming, and a mention of the Apache Beam project as something Talend will support in the future to that end.

In the time left between all this I got to talk to some of the exhibitors. The most intriguing one was a Belgian company who make Anatella, which would catch the eye of anyone working in telecoms for their binary CDR processing capability. The demos included some fascinating social network graph analysis derived from phone calls: e.g. flemish speaking belgians tend to call other flemish speaking belgians and french speaking belgians tend to call other french speaking belgians, with the exception of Brussels, which proves to be bilingual territory not only in constitutional theory but also in telephony practice! There was a news story a few years ago on similar telephone 'connections' between UK regions, I remember Wales being split in three north to south, with more contact with the corresponding adjacent english region to the east than the rest of wales north and/or south. Fascinating work that can escape the narrow confines of marketing into proper Geography research, not unlike the work of the Flowminder Foundation that I link to on the right.

Friday 10 March 2017

The chimney map

As this blog is among other things about maps, here's a fascinating documentary on a very old map found in Scotland and restored at the national library of Scotland.

Open Data Camp

The 2017 Open Data Camp was at the end of last month in Cardiff. I was told about it in a semi-private exchange with Chris Love. I couldn't make it, but a lot of information is online, the session grid gives you a good overview and links to any info published online. See 'Open data for newbies' for a quick introduction to the subject, among other things it makes the vital distinction between public and open data. Data in pdf seems to be the running joke, though last I heard, Tableau is promising a pdf connector!

There are some fascinating sessions, such as 'how to get 1 million people speaking welsh', which of course begs the question of how can you define, measure and model the growth of welsh speakers. There's also a minecraft session, unsurprising, I've seen a demonstration of LiDAR data in Minecraft in last year's Cambridge Dorkbot.