Showing posts with label Programming Languages. Show all posts
Showing posts with label Programming Languages. Show all posts

Sunday, 29 July 2018

Microsoft flow as a last resort in place of cron and windows scheduler

You know the situation: while you are not the biggest fan of MS out there, your employer's IT department are. Of course you don't really have proper access to a Linux system to set up a cronjob to do your data fetching for you, and you don't even have an 'always on' windows machine to be able to do this via Task Scheduler.

Well, you have to make do with what you have, and in this case, that is Office 365, a lesser known component of it called Flow. There is an upside to this to rescue your pride somewhat: while it is a humble office app it uses the same language as azure logic apps, so think of it as cloud training!

Tuesday, 30 January 2018

Converting UK national grid to latitude longitude Python function from Hannah Fry

I found Hannah Fry's python function useful with a dataset in national grid coordinates. It's not something that could be done easily in a Tableau calculated field as it is an iterative calculation that converges to the solution. Hannah Fry does have a Tableau connection though; she was a keynote speaker in the London Tableau conference on tour a few years ago. That presentation gave some key insights on Tom Cruise's central upper tooth, as well as showing pictures of two clones of Hannah with symmetrical faces, one based on the left side of her face and one based on the right hand side!

Monday, 1 May 2017

Why I'm learning Pig

I've made fun of the Apache Pig project the first time I came across it, but I take it back. I now fully see its value and I'm learning how to use it. As there is a lot of ignorant discussion online and offline claiming that Pig and Hive are equivalent tools and that the difference is one of syntax between SQL like (declarative) HiveQL and scripting style procedural Pig Latin, let me explain how I got convinced of the need to use both.

I came to Hadoop gaining access to a system set up for us by a cloud provider, and a lot (but not all) of the data I'm interested in being in HDFS and Hive tables. In that situation, it's taking me a while to figure out what every part of the Hadoop ecosystem does and how it could be useful to me. Hive was the one thing that seemed immediately useful and worth learning: it had a lot of data I was interested in, it sort of follows an accessible standard language (SQL), and it offers quite powerful statistics . An initial presentation on it from the provider claimed it could be used for Hypothesis testing, Predictive Analytics etc., and while that seems a bit misleading in retrospect, Hive can provide all the statistics needed by any specialist tool that does the testing or the prediction. So far so good. I did play with Spark a few times to figure out what it is and how it works, but the barrier to entry there seemed definitely higher: you have to worry about cluster configuration, memory etc. when you launch jobs and you have to use a lot of low level code (RDDs, closures etc.)

One of the knowledge exchange sessions with the provider was on Hadoop being used for ad hoc analysis. Their suggested process was: copy data to HDFS, load data into newly created Hive table, load data from Hive table into Spark dataframe, do certain operations, convert to RDD, do more operations. It seemed awfully complicated. When there was a need to do such analysis, I realised I needed to define a 44 column table schema when I only wanted to average one column grouped by the contents of another, and gave up on using Hadoop at all for the task. It bothered me that I didn't know how to do something this simple on Hadoop though, so I kept reading through books and searching online until Pig emerged as the obvious solution. The syntax for what I wanted to do was ridiculously easy:
file_data = LOAD 'hdfs://cluster/user/username/file.csv' USING PigStorage(',');
raw_data = FILTER file_data by $0!='field0_name';
data_fields = FOREACH raw_data GENERATE $11 AS file_dimension,  (int)$43 AS file_measure;
data_group = GROUP data_fields by file_dimension;
avg_file_measure = FOREACH data_group GENERATE group,AVG(data_fields.file_measure) AS file_measure_avg;
This example embodies certain aspects of Pig's philosophy: Pigs eat everything, without necessarily requiring a full schema or being particularly difficult about the delimiter field, or the presence of absence of the csv header (which I filter out in the second line of the example). Pig can go even further working with semi-structured and unstructured, non normalised data, that would be entirely unsuitable for Hive without serious processing. Pigs are domestic animals and rather friendly to the user. One of the early presentations on Pig stated that it "fits the sweet spot between the declarative style of SQL, and the low-level, procedural style of MapReduce". I would dare say that this statement could be updated for the Hadoop 2 world with Spark in place of MapReduce, so it is unsurprising that Pig is still heavily used for ETL and other work on Hadoop, and Pig on Spark is in the works (hopefully delivering on the Pigs fly promise). A final point that Pigs live anywhere should comfort anyone worried about learning such a niche language: it is also supported e.g. on Amazon EMR.

So in retrospect: An organisation can adopt Hadoop and throw all its data into a 'data lake' in HDFS. Any competent programmer in that organisation can then use an array of programming approaches (Pig, raw MapReduce, Spark) to analyse this data, some faster to program, others more powerful but requiring more programming effort. This is the fabled 'end of the data warehouse' but only possible if the users of the data can do their own programming. If on the other hand the organisation wants to enable access to the data to non programmer analysts, connect standard BI tools to the data etc. then they adopt Hive, but have to do a lot of the same work that is required for a traditional data warehouse: ETL, normalisation etc. The main advantage of Hive compared to traditional DWH is being able to cope with Big Data that would ground an RDBMS to a halt. In most cases probably a happy medium is reached where key data is in Hive tables, but a lot of other 'niche' data stays in non-structured or non-normalised formats in the data-lake. I have not addressed where NoSQL databases fit into this picture, I promise to come back on the subject when I have a similar NoSQL epiphany.

Tuesday, 28 March 2017

Statistics of statistics



In looking at the pie charts of CPD hours we found that the sum of averages was the average of the sums, whereas the sum of medians was much smaller than the median of the sums. This is another way of saying that the average is a linear function, i.e. it is true that


f(ax+by)=af(x)+bf(y)
whereas the median is non linear. This is quite important from a visualisation point of view as pie charts, stacked bar charts, area graphs etc. imply that the sum of the parts is a meaningful quantity, and in the case of non linear aggregations (median, countd) often it isn’t.

In tables Tableau addresses this with ‘grand total’, the aggregation for which doesn’t have to be a sum but could be e.g. an overall median. If you’ve been careful to never imply the parts can be summed but still find your users exporting data and doing sums in excel, adding a table view with a grand total configured to the suitable aggregation can save you from hours of arguing!

Another case of statistics of statistics in Tableau can arise when using Level of Detail Calculations. I used to do this manually by doing counts at ‘exact date’ level of detail, exporting the counts to excel, re-importing the counts to tableau and then finding the weekly maximum of the counts, effectively using Tableau as my data warehouse/ETL tool as well as the analysis and visualisation tool. The emergence of Level of Detail calculations saved me from all this bother, as now I could plot a
max({fixed [date]:count([x])})
against the date to the nearest week.

Of course there are also cases of using data from a proper data warehouse, whether the traditional RDBMS one or Hive. In that case again it is worth being careful to match any aggregation done in Tableau to the aggregation done in the data warehouse. e.g Min([min_x])can’t go wrong, but the averages can be a bit tricky. Say the original SQL was
SELECT avg(x) AS avg_x GROUP BY d
where d1 has 1 record and d2 has 100! Coming then in Tableau to do an avg([avg_x]) is just asking for trouble. Instead modify the SQL to
SELECT avg(x) AS avg_x, count(1) AS n GROUP BY d
and then we can get the proper average in Tableau with
SUM([avg_x]*[n])/SUM([n])

Wednesday, 8 February 2017

The story of Pig

Yahoo! (don't forget the exclamation mark!) nowadays makes the news for all the bad reasons: Takeover by other companies, hacked by unnamed state actors several times in the past, not to mention the very dodgy advertising all over their websites. It wasn't always thus.

Cast your minds back to the mid noughties, and you'll remember Yahoo! acquiring the smaller stars of the web 2.0 constellation: Flickr which they kept, Delicio.us which they sold on and Upcoming which they retired. Of course Yahoo! had a history of acquiring companies with great products and messing them up previously, from GeoCities to LAUNCHcast. But by 2005 it seemed like they were suddenly getting it and becoming cool.

On the technology side, Yahoo! was a pioneer of Big Data, with Open Source projects such as Hadoop (the writer of that first blog post, Jeremy Zawodny, did later sum up the story of that time nicely in his personal blog), Pig and other bits of that ecosystem that became part of an Apache project rather than a proprietary product.

One wonders if they would be better off now had they kept it as their own product. Maybe they would be the giants of cloud computing. Releasing it as open source though meant that it became an effective industry standard, with other companies contributing projects such as Hive (in fact have a look at this blog post that details the use for both Pig and Hive inside Yahoo. If only someone updated it to add Spark SQL to the mix!). So if anything, if Yahoo! goes down, the Apache Hadoop ecosystem will probably survive.

Friday, 3 February 2017

Pig history, features, application and operations

Would you consider a training course with the agenda below?
• History of Pig
• What is Pig and Why Pig
• Pig Vs. MapReduce
• Features of Pig and Its Application
• Pig Data Model and Pig Operations
I think I'll stick to Hive .