Wednesday 29 November 2017

Tableau repositioning itself with regards to data preparation

For those of us that used Tableau for years, the changes in every version always seem to remove some need for external tools/code for data preparation. Think of the introduction of filled maps, the union feature, the excel data interpreter, the pivot and split, the spatial file connector (don't mention the pdf connector!). While certain Tableau partners/consultants are still keen on the Tableau-Alteryx stack, I'm not convinced of its long term market viability, and neither was Gartner last time I checked. The latest announcement on Project Maestro is a rather aggressive move from Tableau's side into traditional Alteryx territory.

If you do want my advice, learn some basic scripting, some coding, regular expressions, some unix or even good editor skills. You can only go so far with 'friendly tools' and you still have to spend a lot of effort learning them, so you might as well learn an open source transferable skill instead.   

Sunday 26 November 2017

Colouring by secondary source dimension in Tableau, avoiding the asterisk

 The data below is small that you would wonder why I bother blending and not join, or even creating a group. Let's say this is a demonstration of a technique that proved useful with much more big and complicated datasets, and where the non blending dimension of the primary source didn't have an obvious hierarchical relationship with the dimension in the secondary source that yielded the asterisk.

So we have two sources, the primary one lists European election constituency regions per UK nation. The secondary one lists all the MEPs with their region and party
So, how do we blend those two, and do a bar chart of the MEPs of each region with the appropriate party colour coding?

 As you can see, once we put nation as a dimension from the primary source, the secondary source field 'party' cannot be used as a dimension, and we get the dreaded asterisk. Fear not, not all is lost.

There is a work around, but it only works for cases like this where there is a handful of Parties. We create a separate calculated field for each party's MEP, and use measure names on colour, and throw all these party MEP calculated fields on measure values (see screenshot above, calculations below)
if [Party]='CON' or [Party]='UUP' then [MEP] end

if [Party]='LAB' then [MEP] end

if [Party]='UKIP' then [MEP] end

if [Party]='SNP' then [MEP] end

if [Party]!='SNP' and [Party]!='CON' and[Party]!='LAB' and [Party]!='UUP' and [Party]!='UKIP' then [MEP] end
I've given a different scenario of avoiding the asterisk with calculated fields in a blend in a previous post here  

Using human vision's edge detection skills as a colour comparator


When visualising data we have to keep thinking about how human vision works and how can we work in synergy with our viewer's eyes and brain.
Human vision is very good at reconstructing the 3D world from the rather limited information the eye sees, and it can also do that when presented with a 2D image of the 3D world. Artists and scientists have studied and exploited these human abilities.

Basically our brain can do edge detection. It can see that the side of the table cloth is a bit darker than the top of the table, therefore the edge of the table is where the two meet. The luminance component of an image is so much more important to us than the chrominance that we spent nearly a century being amazed at black and white photography and cinema. Even our high tech digital codecs use higher resolution for the luminance than the chrominance components.

In my map visualisations I have tried to exploit this the other way around: use the eye's edge detection ability not so much to find the edge, but to distinguish the light from the dark side. See for example the two maps above. Once we put a hard border line, it saturates our vision, we can no longer see if the East of England or Yorkshire are the darkest blue. This is much like walking in a dark park at night and having a cyclist with strong LED lights coming towards you. You can definitely see the cyclist's light but you can no longer see the path or indeed much else.

The map on the left on the other hand suddenly allows us to make the most of the limited dynamic range. We can now see that the East of England is the darkest blue. It is harder to see the border between areas that are the same colour, but this is a Tableau map, that's what interaction is for. After all quite often these maps are for people who know where the borders are, they are trying to see the borders suggested by the data, not the borders known a priori. Adding a hard border is much like a naive painter's approach: putting more effort into picturing what we know rather than what we see.

Saturday 25 November 2017

Converting hexadecimal values in Tableau

While Tableau has a lot of basic maths and string functions, coping with hexadecimal numbers is not something it can do natively. Let's see how we can do this with calculated fields.

To make the solution easier we break the problem into two: Interpret each hexadecimal digit, and then put the results together to convert the whole number to decimal. We create a calculated field for the rightmost hex digit (1s)
ifnull(int(right([Hex No],1)), 
case right([Hex No],1) 
when 'a' then 10 
when 'b' then 11 
when 'c' then 12 
when 'd' then 13 
when 'e' then 14 
when 'f' then 15 
end)
We do the same for the next digit (16s) where we can use mid([Hex No],5,1) assuming our numbers are in the format 0x023c. Likewise for the next two digits, 16^2s and 16^3s. Then we bring everything together:
[16^3s]*16*16*16+[16^2s]*16*16+[16s]*16+[1s]

Sunday 12 November 2017

Working with Eurostat unemployment data in Tableau

Eurostat publishes among other things, unemployment data for the EU. The data is in a weird format, a hybrid between tab and comma separated values.
Having turned it into proper comma separated format, we remove the colons and pivot the years to get a more tableau friendly format:
While the year and quarter can be split, a dateparse calculation is more helpful:
dateparse("yyyy'Q'q",[Pivot Field Names])
 This finally allows us to do some analysis, looking at the unemployment trend for the ten countries of the 2004 accession. Cyprus which I've chosen to highlight is a definite outlier.
It had relatively low unemployment before entering the EU, has a fairly mild recession until the greek PSI when unemployment rises and rises, until the final bail in and associated bank collapse in 2013. This was a double dip, or in the case of unemployment, double peak situation, with the situation improving only to become worse again around the end of 2014 and beginning of 2015. Croatia was the only country of the group in a worse situation until Q3 2016 when the two cross over.  

Sunday 22 October 2017

Work around for problems with Split function when connecting Tableau to PostgreSQL

I'm a big fan of the split function when doing calculated fields. Partly because far too often, the fields from the sources I work with are concatenations of other fields, or even because I've done a union of several CSV sources and some crucial bit of information is hidden in the table name/path. But recently I tried to use this with PostgreSQL and I got the following error
  • ERROR: function split_part(text[], unknown, integer) does not exist; Error while executing the query
Now I could investigate this further on the PostgreSQL side of things but I just want to get my thing to work and move on. So instead here's the workaround I came up with:
mid([concat_str],start_of_split, find([concat_str],'/',start_of_split)-start_of_split)
where start_of_split is a fixed number of characters (maybe you want to do another find here) and '/' defines the delimiter to use when splitting.

Sunday 20 August 2017

Tableau and negative zero

Another saga from the big data frontier: at work we have a source of GPS data, and I've been working with a colleague to aggregate it to a degree grid, not unlike the example linked. The data is in hive, and bizarrely it has two columns for each dimension, latitude magnitude as a positive float, and latitude sense as a string (N or S) etc. for longitude. To make our life simple we round the magnitude in the custom SQL that connects to hive, and we make the data signed again with a simple calculation in the Tableau source:
[round Latitude magnitude]*
(case [latitude sense]when 'S' then -1
when 'N' then 1
end)
The rounding of the magnitude is fine as we also keep the sense in the group by dimensions in the custom SQL. The only special case here is when the rounded magnitude is zero, where had we done the sign assignment before the rounding, we'd have one bucket for zero instead of one for 0-0.5 and one for -0.5-0. But surely that shouldn't be an issue once we do the calculation above in tableau?

It turns out that it is an issue. I'm not sure what's happening in the data engine (two's complement going crazy because of negative zero?) but the two zeros are treated differently, recreated with the data of the post linked above.
 Sure enough, looking at the two zeros there's one for -0 and one for +0. So we refine our calculation to avoid multiplying zero by -1

if [round Longitude magnitude]=0 then 0 else
[round Longitude magnitude]*
(case [longitude sense]
when 'W' then -1
when 'E' then 1
end)
end 

Sunday 11 June 2017

Pollster pollster, what do you know?

It's been quite a while since I put something on Tableau public. Once I saw this I couldn't resist. I can't quite get the trend line right, but let that be an exercise for the reader :)


Saturday 27 May 2017

A green electoral trend

The bar chart below, from Green party leaflets, is very interesting. The argument here is that the share of their vote is growing and voting for them is not a lost vote. I have a lot of sympathy for the argument for proportional representation. The 2015 British national election was a joke, with UKIP being a third party in votes but not in seats, and SNP benefiting massively from first past the post to get a monopoly of Scottish seats that is not quite reflected in the vote counts. 
However, this chart poses a few questions for me. Sure, the green vote grew from 2010 to 2015, but was it a one off that cannot be projected into further growth for 2017? 
After all 2015 was the fall of the LibDems, and those voters who felt betrayed by Clegg but hadn't forgiven Labour either could vote Green. The situation of course is very different now, with Labour breaking away entirely from the Blairite past while the LibDems are trying to capitalise on the 48% Remain vote (by repeating the Guardian quote 'Corbyn betrayed us' among other things).
An equally interesting feature of this bar chart is the plateau from 2005 to 2010, which supports the argument that 2015 was an outlier and not the manifestation of an underlying trend. Last decade had very different priorities to the current one; the two big parties had largely taken on-board environmental concerns, the Tories even had a green tree as a logo (the same one that was repainted with the union jack recently), and even though the economic crisis was obvious at the end of the decade, the political priorities took a while to change. So maybe 2010 is the outlier, and the trend is there.
We'll only know once the 2017 results are out!

Sunday 14 May 2017

The Tory deficit

So, another month, another election campaign. Cambridge is a two horse race again, this time there is no dispute on which parties are the two horses, as in the little snippet I scanned from the latest LibDem leaflet. But what about that black arrow over the blue bar? That's what I call 'the Tory deficit'. Not the financial deficit but the one of votes. Or is it?
Let's think about this. The black arrow would bring the Cons up to about 34%. Then the total would be 34+35+36=105% and that's without counting the small parties. I'm also not convinced the bars start from zero, but it's quite hard to compare the length of the arrow against the length of the bars. 
In terms of electoral rhetoric, 'Jeremy Corybn's (sic) candidate' is an interesting choice. Julian Huppert very much lost in 2015 because he was 'Nick Clegg's candidate'. Of course the Liberals are pushing the argument a bit too much by saying (in a previous leaflet that went to recycling without stopping by the scanner) 'Don't vote Tory, you might get Corbyn! 
  

Monday 8 May 2017

It's the economy, stupid!

So I was listening to my BBC local station yesterday. Have you noticed how rather mediocre radio stations make a bit of an extra effort in their weekend programming? Normally this involves some specialist music shows, but BBC Cambridgeshire also has the Naked Scientists . One of the themes of the evening was language, and one of the featured scientists (hopefully not naked) was the economist Keith Chen. The fact he is not a linguistics professor is a crucial thing to note, as well as the fact that he teaches in the school of management and not in the economics department. But I digress.

Keith Chen's main point was that people speaking a language that has an explicit future tense (such as English or Greek) don't save as much money, don't take as much care of their health etc. compared to speakers of languages that don't have a future tense (aparently German is such a language). For the nuanced argument you can read the relevant paper which I have only skimmed through but hey, this is a blog, we don't take ourselves too seriously.

One of his main sources of data is the world values survey. The first thing I notice on visiting their site is the beautiful magic quadrant visualisation known as the Inglehart–Welzel Cultural Map, or occasionally the Welzel-Inglehart Cultural Map. This immediately screams 'Samuel Huntington Clash of civilisations' to me, but I haven't read that book either so I won't get carried away. Just notice how countries are bundled together in mysterious ways: Azerbaijan occasionally becomes orthodox, Israel and Greece catholic, the English speakers are of course exceptionally neither protestant nor catholic even though they could be either or neither, and the colouring does or doesn't always follow the religion, or the cluster, which contorts around accordingly.


So this wonderful source of data proves that future tense equipped languages like the ones mentioned above have speakers that don't plan for the future, and vice versa. The examples quoted included of course the UK and Greece as the worst savers in Europe. This tempted me to use the website facilities to get the table embedded below: 

TOTALCountry Code
CyprusGermany
Save money42.8%13.9%57.0%
Just get by38.4%66.3%24.7%
Spent some savings and borrowed money9.1%12.2%7.6%
Spent savings and borrowed money6.6%5.8%7.1%
DE,SE:Inapplicable ; RU:Inappropriate response; BH: Missing; HT: Dropped out survey0.2%-0.3%
No answer1.7%-2.6%
Don´t know1.1%1.9%0.8%
(N)(3,046)(1,000)(2,046)
To me this data says one thing: People in Germany were well off at the time of the survey, and people in Cyprus were much less well off. When you have money to spare, you save, when you don't you get by, and that has little to do with your language and the way it expresses future events. It has a lot more to do with employment going up or down, banks doing well or being about to collapse, and the euro being too strong or too weak in relation to the country's economic health. In fact Chen went as far as citing Belgium, as an example of where everything else being the same, language is the only factor differentiating people. Perhaps he should check out some call record analysis proving that Belgium is really two parallel societies that meet in Brussels!

I was planning to finish on a note about the sad state of linguistic research but it would be wrong, actually the fact he is in the management school explains the unique blend of positivist prejudice displayed here.

Saturday 6 May 2017

Histograms and data driven aggregation

Unavoidably, once you start taking your work seriously as 'data science' you have to do hypothesis testing. And to do hypothesis testing you need to know the distribution of your data. And the most intuitive way to see the distribution of your data is to plot a histogram.

So in that context, we have a go at plotting a histogram of a field in our data. The advice of our 'big data' provider is - you guessed it - pull data from Hive into a Spark data frame, do some operations, convert to RDD, do some more operations. I'm too lazy for all that so digging around I found that Hive has a histogram function. You might not like the idea as it returns an array of structures that contain the bin centres and the respective frequencies, and it uses some funky binary delimiters, different for the struct fields, the array elements and of course the fields returned by the query. This is complicated enough to merit its own post which I promise to do in the future, but in my book still preferred: No need to do 20 (or 50?) lines of configuration and functional programming where a SQL one-liner would do.

Anyway, having done that I was looking at another field for which we also needed a histogram, and realised that it is really a discete measurement, it was a number somewhere between 40 and 80 that only came with .00, .25, .50 and .75 in the decimal places. Maybe an unsigned 8 bit quantity at the point of measurement/analog to digital conversion? Anyway, that means that to do a histogram you can avoid the binning all together, the data is pretty much 'pre-binned'. Instead it becomes more like the first classic example of any Map Reduce related tutorial: a word count.  How many times does e.g. 50.25 appear in the data, and likewise for all values.

Knowing your data can always save time and effort when trying to analyse it. A key reason to like Tableau is the way it allows you to learn what your dataset looks like and explore it from all possible sides. I have to confess though, the final dataset was simple enough for the histogram to be done as a pivot-chart in Excel!

Monday 1 May 2017

Why I'm learning Pig

I've made fun of the Apache Pig project the first time I came across it, but I take it back. I now fully see its value and I'm learning how to use it. As there is a lot of ignorant discussion online and offline claiming that Pig and Hive are equivalent tools and that the difference is one of syntax between SQL like (declarative) HiveQL and scripting style procedural Pig Latin, let me explain how I got convinced of the need to use both.

I came to Hadoop gaining access to a system set up for us by a cloud provider, and a lot (but not all) of the data I'm interested in being in HDFS and Hive tables. In that situation, it's taking me a while to figure out what every part of the Hadoop ecosystem does and how it could be useful to me. Hive was the one thing that seemed immediately useful and worth learning: it had a lot of data I was interested in, it sort of follows an accessible standard language (SQL), and it offers quite powerful statistics . An initial presentation on it from the provider claimed it could be used for Hypothesis testing, Predictive Analytics etc., and while that seems a bit misleading in retrospect, Hive can provide all the statistics needed by any specialist tool that does the testing or the prediction. So far so good. I did play with Spark a few times to figure out what it is and how it works, but the barrier to entry there seemed definitely higher: you have to worry about cluster configuration, memory etc. when you launch jobs and you have to use a lot of low level code (RDDs, closures etc.)

One of the knowledge exchange sessions with the provider was on Hadoop being used for ad hoc analysis. Their suggested process was: copy data to HDFS, load data into newly created Hive table, load data from Hive table into Spark dataframe, do certain operations, convert to RDD, do more operations. It seemed awfully complicated. When there was a need to do such analysis, I realised I needed to define a 44 column table schema when I only wanted to average one column grouped by the contents of another, and gave up on using Hadoop at all for the task. It bothered me that I didn't know how to do something this simple on Hadoop though, so I kept reading through books and searching online until Pig emerged as the obvious solution. The syntax for what I wanted to do was ridiculously easy:
file_data = LOAD 'hdfs://cluster/user/username/file.csv' USING PigStorage(',');
raw_data = FILTER file_data by $0!='field0_name';
data_fields = FOREACH raw_data GENERATE $11 AS file_dimension,  (int)$43 AS file_measure;
data_group = GROUP data_fields by file_dimension;
avg_file_measure = FOREACH data_group GENERATE group,AVG(data_fields.file_measure) AS file_measure_avg;
This example embodies certain aspects of Pig's philosophy: Pigs eat everything, without necessarily requiring a full schema or being particularly difficult about the delimiter field, or the presence of absence of the csv header (which I filter out in the second line of the example). Pig can go even further working with semi-structured and unstructured, non normalised data, that would be entirely unsuitable for Hive without serious processing. Pigs are domestic animals and rather friendly to the user. One of the early presentations on Pig stated that it "fits the sweet spot between the declarative style of SQL, and the low-level, procedural style of MapReduce". I would dare say that this statement could be updated for the Hadoop 2 world with Spark in place of MapReduce, so it is unsurprising that Pig is still heavily used for ETL and other work on Hadoop, and Pig on Spark is in the works (hopefully delivering on the Pigs fly promise). A final point that Pigs live anywhere should comfort anyone worried about learning such a niche language: it is also supported e.g. on Amazon EMR.

So in retrospect: An organisation can adopt Hadoop and throw all its data into a 'data lake' in HDFS. Any competent programmer in that organisation can then use an array of programming approaches (Pig, raw MapReduce, Spark) to analyse this data, some faster to program, others more powerful but requiring more programming effort. This is the fabled 'end of the data warehouse' but only possible if the users of the data can do their own programming. If on the other hand the organisation wants to enable access to the data to non programmer analysts, connect standard BI tools to the data etc. then they adopt Hive, but have to do a lot of the same work that is required for a traditional data warehouse: ETL, normalisation etc. The main advantage of Hive compared to traditional DWH is being able to cope with Big Data that would ground an RDBMS to a halt. In most cases probably a happy medium is reached where key data is in Hive tables, but a lot of other 'niche' data stays in non-structured or non-normalised formats in the data-lake. I have not addressed where NoSQL databases fit into this picture, I promise to come back on the subject when I have a similar NoSQL epiphany.

Saturday 29 April 2017

Cambridgeshire and Peterborough is a two horse race

In the wider context of the cult of the leader/CTO, Cambridgeshire and the Peterborough unitary authority are bundled together for a devolved mayor election, with a budget for the mayor to tackle housing and transport. To her credit, the green candidate at least proposes forming an assembly to keep this leader in check, but there is little chance of a green mayor. Local election literature told us time and again that it's a two horse race. But which two horses?
Labour have been using the only previous result that covers the same geographical area with a similar electoral process, and helpfully, shows them as the only ones that can beat the tory. They also foolishly put it on their website in jpeg format, with discrete cosine transform artifacts and all. Next time please use png guys!
But the mayor would have important make or break powers on a number of important issues. Houses in Cambridge are about as overpriced as in London. A new train station has led to redevelopment and price hikes in formerly affordable Chesterton as developers prepare to house even more London commuters. Peterborough of course has been discussed in the national press in a number of 'this is why Brexit happened' articles. This election is likely to be taken far more seriously by voters than the police and crime commissioner one in 2016.
On the other hand, Lib Dems are doing something even worse, using the Cambridgeshire county council results that wouldn't include any Peterborough votes at all. After all they still have residual support in Cambridge from back in the day when they were the anti-war, anti-fees party to the left of New Labour, whereas in Peterborough the Tories are much stronger, and they are a distant third party.
For more close monitoring of election visualisations, see Phil Rodgers' blog  .

Tuesday 28 March 2017

Statistics of statistics



In looking at the pie charts of CPD hours we found that the sum of averages was the average of the sums, whereas the sum of medians was much smaller than the median of the sums. This is another way of saying that the average is a linear function, i.e. it is true that


f(ax+by)=af(x)+bf(y)
whereas the median is non linear. This is quite important from a visualisation point of view as pie charts, stacked bar charts, area graphs etc. imply that the sum of the parts is a meaningful quantity, and in the case of non linear aggregations (median, countd) often it isn’t.

In tables Tableau addresses this with ‘grand total’, the aggregation for which doesn’t have to be a sum but could be e.g. an overall median. If you’ve been careful to never imply the parts can be summed but still find your users exporting data and doing sums in excel, adding a table view with a grand total configured to the suitable aggregation can save you from hours of arguing!

Another case of statistics of statistics in Tableau can arise when using Level of Detail Calculations. I used to do this manually by doing counts at ‘exact date’ level of detail, exporting the counts to excel, re-importing the counts to tableau and then finding the weekly maximum of the counts, effectively using Tableau as my data warehouse/ETL tool as well as the analysis and visualisation tool. The emergence of Level of Detail calculations saved me from all this bother, as now I could plot a
max({fixed [date]:count([x])})
against the date to the nearest week.

Of course there are also cases of using data from a proper data warehouse, whether the traditional RDBMS one or Hive. In that case again it is worth being careful to match any aggregation done in Tableau to the aggregation done in the data warehouse. e.g Min([min_x])can’t go wrong, but the averages can be a bit tricky. Say the original SQL was
SELECT avg(x) AS avg_x GROUP BY d
where d1 has 1 record and d2 has 100! Coming then in Tableau to do an avg([avg_x]) is just asking for trouble. Instead modify the SQL to
SELECT avg(x) AS avg_x, count(1) AS n GROUP BY d
and then we can get the proper average in Tableau with
SUM([avg_x]*[n])/SUM([n])

Monday 27 March 2017

Politicians for analytics and analytics for politicians



MP Daniel Zeichner (Labour) has set up a data analytics all party parliamentary group. The membership of the group includes Labour and Conservatives, a token Lib-Dem and an ex-hereditary cross bencher lord. The SNP, a far more important player in this parliament than the Lib Dems, is not represented.
On the other side of the political mainstream, David Willetts as a Universities and Science coalition minister a few years ago identified Big Data as one of the 8 great technologies that could guarantee of future growth of the British economy along with space, robotics and autonomous systems, synthetic biology, regenerative medicine, agri-science, advanced materials and energy. Space was probably closer to his heart, and he retired to a position on the board of satellite maker SSTL.
Of course data analytics have been at the heart of British political debate for a while now: the fixation with quantifying performance and ranking schools and hospitals has been central to any discussion on education and the NHS, the central issues for the Blair and Cameron governments correspondingly. The Blair government has been pivotal with its ideas about evidence based policy for our supposedly post-ideological times.
We’ve already seen the increasing importance of data in local government which is tied to the emergence of ‘Smart Cities’, while mobile telecoms mined data is used for things like transport planning. For that most political of issues, data protection, the increasing contradictions between the incoming EU GDPR and the UK’s snooper charter might be partially resolved by Brexit, though the UK in its typical way might be quite relaxed about following EU regulations while still obliged to do so, and the media will play its part in making snooping palatable to the voters.  

Monday 20 March 2017

data @peterborough.gov.uk

Following on from the talk from a Peterborough council employee in Big Data world, I had a look for their open data. They have an impressive portal at http://data.peterborough.gov.uk/ , the platform in use (Data Share) is developed by the London borough of Redbridge.

I chanced upon the topical dataset of split of social and affordable housing allocations between British nationals, Europeans and other foreigners. This is of special interest not only because the issue of EU citizens benefits has been at the centre of the Brexit debate, but also because Peterborough has been singled out in a couple of newspaper articles as a place that helps understand the brexit vote (already from the last general election, following the UKIP referendum campaign, and in the aftermath).

Data share has a funny start page where you follow one link to view data and another to download it, each leading to categories of datasets, and each category's link to the list of datasets in the category. Quite a lot of clicking through, especially if you first view and then decide to download. Thankfully there is actually a download button in the viewing area.

Viewing the data shows a table but gives some other options, the interface reminds me a bit of the built in visualisations in the Zeppelin Notepad.
This doesn't live up to the promise though, try any view other than the table and what you get is a visualisation of the number of records in the table per value of the dimension selected for the category axis, not even fitting in one page and not in chronological order either!
It does look like the Peterborough data and the Data share platform haven't been tuned to work with each other when it comes to visualisation. Changing the dimension for the category axis becomes even more revealing:
So this is a bit like the modal value of my CPD hours :) . I have a visualisation telling me that there are two records/quarters in the dataset for which there were 197 houses allocated to british nationals, as opposed to having only one quarter for which 200 houses were allocated. This might interest a conspiracy theorist with numerology fixations, but is far from insightful. Downloading the data and playing with it even in excel can get you a bit further, though the 'Apr-Jun 10' format is not great. Instead we load into tableau, split the period and modify the year to four digits:
'20'+TRIM( SPLIT( [Period], " ", 2 ) )
We also pivot the various nationalities to give a more tableau friendly format
 The remaining issue is the occasional 0-3 value, which might be there to 'anonymize' the dataset rather than single out the hypothetical one Czech family that got a council house in a particular quarter of a particular year. Change the data type to number(whole) which is fine with all other values and gives null for the 0-3 and then add another calculation
ifnull([Pivot Field Values],3)
I could have used zn and turned them to zeros, but if the hypothesis we are investigating is 'the bad immigrants take all the houses' we need to take the 'worst' case scenario. Now the total number of houses allocated fluctuates so looking at percent of total allocations with the table calculation computed at the cell level (i.e. percent of the total for the quarter) we can see how the relative percentages of nationalities fluctuate. As the nationalities are already partially grouped, I group the eastern Europeans in with the rest of the EEA nationals. I also stick in one crucial number from the UKIP campaign Grauniad article: 79.4% of Peterborough's population were born in the UK. Of course there are plenty of British nationals not born in the UK, but lets use what numbers we have at hand.

The end result shows that the percentage of British nationals getting social and affordable housing hovers just under the percentage of British born people in Peterborough. I can guess here that if you are rich or at least well off in Peterborough you are more likely to be British (an assumption that wouldn't necessarily hold in London), so we really need the percentage nationalities of people who can't afford market rent in Peterborough. But in any case it shows that things are not as bad as the benefit tourist story wants you to think. After all many of the EEA nationals in council housing could be working in the warehouses mentioned in the more recent Grauniad article. Which boils the question down to why does the British government not enforce a living wage more strictly, if the benefit seekers are a drain on the national finances.

Sunday 19 March 2017

EU data protection compliance

As promised here's a look at where the data protection compliant countries lie on the map. The geospatial insight? They could be anywhere in the world, unless they are a tiny island state/dependency in which case they have to be within about 10 degrees east or west of the Greenwich meridian!



Saturday 18 March 2017

Pie chart epic fail

I've just submitted my continuous professional development records to the professional body I'm a member of. They use a rather clunky data entry web form, and once everything goes through you get some dashboards (of course!). The code behind dashboard generation is quite inflexible, it really couldn't believe I would give it zero for red, blue and green metrics, so it still zooms on them in a second pie chart, where 3 zeroes take a third of the pie each. Is there an official name for this shaded pie-chart? Soup bowl perhaps? 
My recorded CPD hours can be compared to the whole set of the same membership category as me, whose results come in pie charts.

set mean
set median
Now something interesting is happening here, as the median of blue and green is 0 while the mean is non zero. This means at least half the members in the set have done zero blue and zero green hours, much like myself. Also hilariously the sum of the median hours is less than the obligatory 30. This goes to show why it is such a bad idea to do a pie chart of medians, as a pie chart implies a total value for the whole pie with some real meaning.

Now going to the bar chart showing all categories with a choice of metric between Mode, Mean and Median. Mode is a really odd choice for what is really a continuous metric, and sure enough, by selecting it I get an error:
One or more CPD Hours sets has no single modal value
 The means and the medians are shown below
Mean
Median
 The meager 30.3 is me. The dark green has the same mean and median (124) so I would hazard a guess that it is a set of only one member, especially given the mean is round. It's interesting that for the other sets the median is just over half the mean, but we know that at this point they might also have very few members values all over the place. Or an uber keen minority in each set is skewing the mean upwards.

The blue is my set. The mean of 107.2 matches the total of the set average pie chart. That is not the case with the median, but of course the sum of medians is not the median of the sums, whereas the sum of averages is the average of the sums. The fact the sum of medians is actually much smaller than the median of sums, shows us that most members of the set don't divide their time equally. There's scope for a whole other post on statistics of statistics here, I'll do that some other time.

Wednesday 15 March 2017

Big Data World day 1 review

This event was today at Excel in London. It is handily co-located with Cloud and Smart IoT events, which was a major draw for me, as I am not strictly speaking a data/IT guy, more of an analyst and wannabe subject matter expert with IoT and increasingly cloud within my professional interests. As the write up is quite long the following list of links lets you jump to the summary of the relevant presentation:


I did start from the Cloud Expo end of the room as there was a whole 'theatre' devoted to talks on software defined networks. Judging from the first presentation which looked at SD-WANs, these guarantee the exact opposite of net neutrality to business users, prioritising cloud business applications at the expense of e.g. YouTube traffic. The technology involves 'aggregating and orchestrating' a variety of links of varying performance and availability from MPLS and 3G to old school internet, and intelligent path control for different applications' traffic, e.g. routing VoIP traffic through MPLS, thus being able to guarantee QoS above 'best effort'. Interesting ideas on SLAs for application uptime and even application performance could extend the current SLAs of network availability. To my simplistic brain, it would all be simpler and safer if we keep most applications offline and standalone, instead of putting everything on the cloud, but there's a cloud enabled industrial revolution in progress, which involves cost cutting and getting rid of half the IT department. Clearly that is the main driver, with the downside being increasing demands on the network.

In complete contrast, I moved on to a presentation in the Big Data World area on the General Data Protection Regulation. This made me feel suddenly quite nostalgic for the good old EU, its much maligned courts and the human right to a private life. The list of non EU countries that conform  to european data protection standards is interesting to say the least: Andorra, Argentina, Canada, Faeroe Islands, Guernsey, Israel, Isle of Man, Jersey, New Zealand, Switzerland and Uruguay. Maybe there's material there for a tableau map and some spatial insights!

Next was SelectStar, a database monitoring solution. Their main selling point is being a single monitoring tool for all database technologies, as increasingly companies will have a mixture of open source and proprietary relational and NoSQL solutions, rather than be an e.g. 'all Oracle' house. I was hoping to get some tips on actual monitoring of the health of a Hadoop cluster, but they don't do much in addition to what the built in Hadoop monitoring tools offer, in fact Hadoop is quite a recent addition to their offerings.

I try and balance attending presentations relevant to my job to things that are just plain interesting. In the spirit of the latter, I went to the 'deep learning' keynote, which as one member of the audience noted, was using mostly image recognition examples. The presenter explained that they make good presentation material, and the technique has wider applications. The key thing I took from this was that the feature extraction is now also 'trainable', whereas in old school machine learning only the classification stage was. I'm not fully convinced machines are better than humans at deciding what the best feature vector is, and I should read up on any speech/music processing applications of this as I have a better theoretical grounding in that field. Would a machine ever come up with something as brilliant but crazy as the cepstrum?

I next attended a talk on deriving 'actionable insight' from data. This is now such a cliche that is a bit meaningless, especially when the whole brigade of viz whizz kids are into dashboard visualisations with little accompanying text, carrying little of the insight a well illustrated text heavy report will give you. Refreshingly, and perhaps tellingly on the differing priorities of the private and public sector data world, the speaker used no visual material whatsoever. He just talked about his work as a data analyst for Peterborough city council, and projects such as identifying unregistered rental properties by fusing a variety of datasets including council tax and library user details, or analysing crime patterns! I should look into what open data offerings they have.

The weather company, finally a (nearly) household name, was next, in a keynote on weather for business. They interestingly crowdsource some of their data to achieve 'hyperlocal' resolution of 500m, 96 times a day globally, while a major sector of their corporate customers, airlines, are also data providers.  They have a unique data processing platform and the speaker did put a certain emphasis on their acquisition by IBM and the 'insights and cognition' synergies that will result from it.

I then ventured into the Smart IoT area for a talk from O2/Telefónica. The data protection theme was present here again as they make sure their data is secure and anonymous via aggregation and extrapolation, and a certain amount of weeding out of any data that is not aggregated enough and therefore could be used to identify individual users, and users consenting to the use of their data also came up during question time. They derive location via both 'active' (calls) and 'passive' (cell handovers) means, as well as behavioural data from apps and websites visited, and 'in venue' information from their wifi hotspots and small cells (another hint at the seamless integration of multiple networks mentioned earlier). This builds up to 6 billion events on a typical day, and they keep an archive of 4 years. These events are classified into 'settles' and 'journeys', analysed to identify origins and destinations, with uses ranging from transport planning, audience profiling, retail decision making etc.

Back to the other end of the hall to hear self professed ubergeek Andi Mann of Splunk on DevOps . Follow the link as he can summarise his ideas much better than I can. He gave me another interesting fact on Yahoo's past as a cool tech parent company, despite today's headlines on security breaches: the presentation on dev and ops cooperation at Flickr that was key to starting the devops conversation back in 2009. I think the idea of devops has applications outside software, in fact a lot of the operational intelligence work I do sits somewhere between operations and product/service development.

I did skip the QlikView presentation for the sake of Splunk, but came back to the Big Data keynotes for Esri. Their presentation focused on the benefits of spatial analysis, giving the grid aggregation I wrote about in the past as an example, along with navigation, fraud transaction detection and even insurance for ships going into pirate infested waters!

Finally a joint act between Wejo and Talend on their connected car project. This was interesting for many reasons, from the technologies involved on the car such as eCall  to the technology used for data processing, as Talend offers an open source ETL tool that can sit on top of Spark. On that latter front there was a certain emphasis on the benefits of having a unified infrastructure for batch processing and streaming, and a mention of the Apache Beam project as something Talend will support in the future to that end.

In the time left between all this I got to talk to some of the exhibitors. The most intriguing one was a Belgian company who make Anatella, which would catch the eye of anyone working in telecoms for their binary CDR processing capability. The demos included some fascinating social network graph analysis derived from phone calls: e.g. flemish speaking belgians tend to call other flemish speaking belgians and french speaking belgians tend to call other french speaking belgians, with the exception of Brussels, which proves to be bilingual territory not only in constitutional theory but also in telephony practice! There was a news story a few years ago on similar telephone 'connections' between UK regions, I remember Wales being split in three north to south, with more contact with the corresponding adjacent english region to the east than the rest of wales north and/or south. Fascinating work that can escape the narrow confines of marketing into proper Geography research, not unlike the work of the Flowminder Foundation that I link to on the right.

Friday 10 March 2017

The chimney map

As this blog is among other things about maps, here's a fascinating documentary on a very old map found in Scotland and restored at the national library of Scotland.

Open Data Camp

The 2017 Open Data Camp was at the end of last month in Cardiff. I was told about it in a semi-private exchange with Chris Love. I couldn't make it, but a lot of information is online, the session grid gives you a good overview and links to any info published online. See 'Open data for newbies' for a quick introduction to the subject, among other things it makes the vital distinction between public and open data. Data in pdf seems to be the running joke, though last I heard, Tableau is promising a pdf connector!

There are some fascinating sessions, such as 'how to get 1 million people speaking welsh', which of course begs the question of how can you define, measure and model the growth of welsh speakers. There's also a minecraft session, unsurprising, I've seen a demonstration of LiDAR data in Minecraft in last year's Cambridge Dorkbot.


Sunday 26 February 2017

Layering marks and polygons on map

This is a trick that has become much simpler to perform since the introduction of union in Tableau 9.3.

We start with two data files, one with the vertices of our polygons and another with the locations where we want the marks. We create a union of those two when we create our tableau data source.
the wildcard union is particularly handy for multiple files so keep it in mind, in this case we don't really need it. What we then need to do is select the Latitude and the lat columns, right click and select 'Merge mismatched fields', likewise for the longitudes.

Then we can create our map with these merged latitudes and longitudes, but we really want to create two maps, one for each layer. Here's how to create the marks map:
Beware of the averaged coordinates, if you don't put all the dimensions in the level of detail you might not get a mark for each row in your dataset! And here's how to create the polygon:

Now we need to select dual axis and right click and hide the 'Null' location.This will give us the desired two layer map.

As it happens, my marks are the centroids of post-codes. So we can tell tableau that through the geographic role of the location field, and select filled maps as the type of mark to get the postcode polygon instead of the dot at the centroid. Note that the (generated) Latitude and Longitude is no good for this as it is not visible when editing the source and cannot be merged with the mismatched latitude longitude of the polygon source after the union, they can't even be used in calculated fields which could be another way round (the pre-9.3 way of doing things). So an original text only source might have to be imported into tableau and the generated coordinates will have to be copied to a new source to use for a union.


Friday 24 February 2017

Alternative maps

Don't get me wrong. I don't read defense magazines. The extracts below are from the Radio User magazine, that I used to get both from a professional interest in maritime and aero communications and a hobby interest in shortwave and medium wave listening ( DX-ing). It is an excellent source of information on ADS-B and AIS as well as general search and rescue and safety comms, and of course long distance broadcasting and propagation conditions.

It does however also cater for plane spotters and scanner enthusiast following military exercises, hence the 'alternative' map of Scandinavia below. The description of the made up fringe country between Norway and Iceland is hilarious, any similarities to Brexit island are purely coincidental


The challenge as always of course is how to re-make the map in Tableau. To make my life easier I'm ignoring the made up internal borders inside Sweden and Finland, and focusing on relocating New Zealand to the North Atlantic.

Lets start with a dataset listing the necessary countries. Tableau is very good at mapping them in their usual locations.


But how do we move New Zealand? Let's have a go extracting the generated coordinates and fiddling with the numbers. We select the countries on the map, view the data and it's a rare case that the summary has more info (generated coordinates) than the full data. Select everything and copy into the csv creator of choise, any spreadsheet application will do.

Now move New Zealand to its 'alternative' position, lets say 60N 5W, remove the (generated) from the header, save the csv and re-import into Tableau.If you put the new coordinates on X and Y you get dots at the right places. But change to a filled map and New Zealand stubbornly refuses to move from the Pacific to the Atlantic.

Ok so clearly Tableau is doing things behind the scenes. Of course it has to as we haven't specified the relative size of the countries. Let's do that by adding the area of each country in another column. So can we escape by setting the geographic role to 'None'? 


It has flattened Norway, so it is sort of re-projecting the shape from Mercator projection. But everything is in the wrong place, and we've pushed the size as far up as possible. Bear in mind that we have another weapon in our disposal, zooming. Tableau doesn't zoom into the marks at the same magnification levels that it zooms into a map. You can see this by zooming into my Makeover Sunday II map. In this case, we want to zoom out to see if it gels the countries together. 


So it's not that simple. At this point we give up and leave it for another day.