Saturday 27 May 2017

A green electoral trend

The bar chart below, from Green party leaflets, is very interesting. The argument here is that the share of their vote is growing and voting for them is not a lost vote. I have a lot of sympathy for the argument for proportional representation. The 2015 British national election was a joke, with UKIP being a third party in votes but not in seats, and SNP benefiting massively from first past the post to get a monopoly of Scottish seats that is not quite reflected in the vote counts. 
However, this chart poses a few questions for me. Sure, the green vote grew from 2010 to 2015, but was it a one off that cannot be projected into further growth for 2017? 
After all 2015 was the fall of the LibDems, and those voters who felt betrayed by Clegg but hadn't forgiven Labour either could vote Green. The situation of course is very different now, with Labour breaking away entirely from the Blairite past while the LibDems are trying to capitalise on the 48% Remain vote (by repeating the Guardian quote 'Corbyn betrayed us' among other things).
An equally interesting feature of this bar chart is the plateau from 2005 to 2010, which supports the argument that 2015 was an outlier and not the manifestation of an underlying trend. Last decade had very different priorities to the current one; the two big parties had largely taken on-board environmental concerns, the Tories even had a green tree as a logo (the same one that was repainted with the union jack recently), and even though the economic crisis was obvious at the end of the decade, the political priorities took a while to change. So maybe 2010 is the outlier, and the trend is there.
We'll only know once the 2017 results are out!

Sunday 14 May 2017

The Tory deficit

So, another month, another election campaign. Cambridge is a two horse race again, this time there is no dispute on which parties are the two horses, as in the little snippet I scanned from the latest LibDem leaflet. But what about that black arrow over the blue bar? That's what I call 'the Tory deficit'. Not the financial deficit but the one of votes. Or is it?
Let's think about this. The black arrow would bring the Cons up to about 34%. Then the total would be 34+35+36=105% and that's without counting the small parties. I'm also not convinced the bars start from zero, but it's quite hard to compare the length of the arrow against the length of the bars. 
In terms of electoral rhetoric, 'Jeremy Corybn's (sic) candidate' is an interesting choice. Julian Huppert very much lost in 2015 because he was 'Nick Clegg's candidate'. Of course the Liberals are pushing the argument a bit too much by saying (in a previous leaflet that went to recycling without stopping by the scanner) 'Don't vote Tory, you might get Corbyn! 
  

Monday 8 May 2017

It's the economy, stupid!

So I was listening to my BBC local station yesterday. Have you noticed how rather mediocre radio stations make a bit of an extra effort in their weekend programming? Normally this involves some specialist music shows, but BBC Cambridgeshire also has the Naked Scientists . One of the themes of the evening was language, and one of the featured scientists (hopefully not naked) was the economist Keith Chen. The fact he is not a linguistics professor is a crucial thing to note, as well as the fact that he teaches in the school of management and not in the economics department. But I digress.

Keith Chen's main point was that people speaking a language that has an explicit future tense (such as English or Greek) don't save as much money, don't take as much care of their health etc. compared to speakers of languages that don't have a future tense (aparently German is such a language). For the nuanced argument you can read the relevant paper which I have only skimmed through but hey, this is a blog, we don't take ourselves too seriously.

One of his main sources of data is the world values survey. The first thing I notice on visiting their site is the beautiful magic quadrant visualisation known as the Inglehart–Welzel Cultural Map, or occasionally the Welzel-Inglehart Cultural Map. This immediately screams 'Samuel Huntington Clash of civilisations' to me, but I haven't read that book either so I won't get carried away. Just notice how countries are bundled together in mysterious ways: Azerbaijan occasionally becomes orthodox, Israel and Greece catholic, the English speakers are of course exceptionally neither protestant nor catholic even though they could be either or neither, and the colouring does or doesn't always follow the religion, or the cluster, which contorts around accordingly.


So this wonderful source of data proves that future tense equipped languages like the ones mentioned above have speakers that don't plan for the future, and vice versa. The examples quoted included of course the UK and Greece as the worst savers in Europe. This tempted me to use the website facilities to get the table embedded below: 

TOTALCountry Code
CyprusGermany
Save money42.8%13.9%57.0%
Just get by38.4%66.3%24.7%
Spent some savings and borrowed money9.1%12.2%7.6%
Spent savings and borrowed money6.6%5.8%7.1%
DE,SE:Inapplicable ; RU:Inappropriate response; BH: Missing; HT: Dropped out survey0.2%-0.3%
No answer1.7%-2.6%
Don´t know1.1%1.9%0.8%
(N)(3,046)(1,000)(2,046)
To me this data says one thing: People in Germany were well off at the time of the survey, and people in Cyprus were much less well off. When you have money to spare, you save, when you don't you get by, and that has little to do with your language and the way it expresses future events. It has a lot more to do with employment going up or down, banks doing well or being about to collapse, and the euro being too strong or too weak in relation to the country's economic health. In fact Chen went as far as citing Belgium, as an example of where everything else being the same, language is the only factor differentiating people. Perhaps he should check out some call record analysis proving that Belgium is really two parallel societies that meet in Brussels!

I was planning to finish on a note about the sad state of linguistic research but it would be wrong, actually the fact he is in the management school explains the unique blend of positivist prejudice displayed here.

Saturday 6 May 2017

Histograms and data driven aggregation

Unavoidably, once you start taking your work seriously as 'data science' you have to do hypothesis testing. And to do hypothesis testing you need to know the distribution of your data. And the most intuitive way to see the distribution of your data is to plot a histogram.

So in that context, we have a go at plotting a histogram of a field in our data. The advice of our 'big data' provider is - you guessed it - pull data from Hive into a Spark data frame, do some operations, convert to RDD, do some more operations. I'm too lazy for all that so digging around I found that Hive has a histogram function. You might not like the idea as it returns an array of structures that contain the bin centres and the respective frequencies, and it uses some funky binary delimiters, different for the struct fields, the array elements and of course the fields returned by the query. This is complicated enough to merit its own post which I promise to do in the future, but in my book still preferred: No need to do 20 (or 50?) lines of configuration and functional programming where a SQL one-liner would do.

Anyway, having done that I was looking at another field for which we also needed a histogram, and realised that it is really a discete measurement, it was a number somewhere between 40 and 80 that only came with .00, .25, .50 and .75 in the decimal places. Maybe an unsigned 8 bit quantity at the point of measurement/analog to digital conversion? Anyway, that means that to do a histogram you can avoid the binning all together, the data is pretty much 'pre-binned'. Instead it becomes more like the first classic example of any Map Reduce related tutorial: a word count.  How many times does e.g. 50.25 appear in the data, and likewise for all values.

Knowing your data can always save time and effort when trying to analyse it. A key reason to like Tableau is the way it allows you to learn what your dataset looks like and explore it from all possible sides. I have to confess though, the final dataset was simple enough for the histogram to be done as a pivot-chart in Excel!

Monday 1 May 2017

Why I'm learning Pig

I've made fun of the Apache Pig project the first time I came across it, but I take it back. I now fully see its value and I'm learning how to use it. As there is a lot of ignorant discussion online and offline claiming that Pig and Hive are equivalent tools and that the difference is one of syntax between SQL like (declarative) HiveQL and scripting style procedural Pig Latin, let me explain how I got convinced of the need to use both.

I came to Hadoop gaining access to a system set up for us by a cloud provider, and a lot (but not all) of the data I'm interested in being in HDFS and Hive tables. In that situation, it's taking me a while to figure out what every part of the Hadoop ecosystem does and how it could be useful to me. Hive was the one thing that seemed immediately useful and worth learning: it had a lot of data I was interested in, it sort of follows an accessible standard language (SQL), and it offers quite powerful statistics . An initial presentation on it from the provider claimed it could be used for Hypothesis testing, Predictive Analytics etc., and while that seems a bit misleading in retrospect, Hive can provide all the statistics needed by any specialist tool that does the testing or the prediction. So far so good. I did play with Spark a few times to figure out what it is and how it works, but the barrier to entry there seemed definitely higher: you have to worry about cluster configuration, memory etc. when you launch jobs and you have to use a lot of low level code (RDDs, closures etc.)

One of the knowledge exchange sessions with the provider was on Hadoop being used for ad hoc analysis. Their suggested process was: copy data to HDFS, load data into newly created Hive table, load data from Hive table into Spark dataframe, do certain operations, convert to RDD, do more operations. It seemed awfully complicated. When there was a need to do such analysis, I realised I needed to define a 44 column table schema when I only wanted to average one column grouped by the contents of another, and gave up on using Hadoop at all for the task. It bothered me that I didn't know how to do something this simple on Hadoop though, so I kept reading through books and searching online until Pig emerged as the obvious solution. The syntax for what I wanted to do was ridiculously easy:
file_data = LOAD 'hdfs://cluster/user/username/file.csv' USING PigStorage(',');
raw_data = FILTER file_data by $0!='field0_name';
data_fields = FOREACH raw_data GENERATE $11 AS file_dimension,  (int)$43 AS file_measure;
data_group = GROUP data_fields by file_dimension;
avg_file_measure = FOREACH data_group GENERATE group,AVG(data_fields.file_measure) AS file_measure_avg;
This example embodies certain aspects of Pig's philosophy: Pigs eat everything, without necessarily requiring a full schema or being particularly difficult about the delimiter field, or the presence of absence of the csv header (which I filter out in the second line of the example). Pig can go even further working with semi-structured and unstructured, non normalised data, that would be entirely unsuitable for Hive without serious processing. Pigs are domestic animals and rather friendly to the user. One of the early presentations on Pig stated that it "fits the sweet spot between the declarative style of SQL, and the low-level, procedural style of MapReduce". I would dare say that this statement could be updated for the Hadoop 2 world with Spark in place of MapReduce, so it is unsurprising that Pig is still heavily used for ETL and other work on Hadoop, and Pig on Spark is in the works (hopefully delivering on the Pigs fly promise). A final point that Pigs live anywhere should comfort anyone worried about learning such a niche language: it is also supported e.g. on Amazon EMR.

So in retrospect: An organisation can adopt Hadoop and throw all its data into a 'data lake' in HDFS. Any competent programmer in that organisation can then use an array of programming approaches (Pig, raw MapReduce, Spark) to analyse this data, some faster to program, others more powerful but requiring more programming effort. This is the fabled 'end of the data warehouse' but only possible if the users of the data can do their own programming. If on the other hand the organisation wants to enable access to the data to non programmer analysts, connect standard BI tools to the data etc. then they adopt Hive, but have to do a lot of the same work that is required for a traditional data warehouse: ETL, normalisation etc. The main advantage of Hive compared to traditional DWH is being able to cope with Big Data that would ground an RDBMS to a halt. In most cases probably a happy medium is reached where key data is in Hive tables, but a lot of other 'niche' data stays in non-structured or non-normalised formats in the data-lake. I have not addressed where NoSQL databases fit into this picture, I promise to come back on the subject when I have a similar NoSQL epiphany.