Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Monday, 1 May 2017

Why I'm learning Pig

I've made fun of the Apache Pig project the first time I came across it, but I take it back. I now fully see its value and I'm learning how to use it. As there is a lot of ignorant discussion online and offline claiming that Pig and Hive are equivalent tools and that the difference is one of syntax between SQL like (declarative) HiveQL and scripting style procedural Pig Latin, let me explain how I got convinced of the need to use both.

I came to Hadoop gaining access to a system set up for us by a cloud provider, and a lot (but not all) of the data I'm interested in being in HDFS and Hive tables. In that situation, it's taking me a while to figure out what every part of the Hadoop ecosystem does and how it could be useful to me. Hive was the one thing that seemed immediately useful and worth learning: it had a lot of data I was interested in, it sort of follows an accessible standard language (SQL), and it offers quite powerful statistics . An initial presentation on it from the provider claimed it could be used for Hypothesis testing, Predictive Analytics etc., and while that seems a bit misleading in retrospect, Hive can provide all the statistics needed by any specialist tool that does the testing or the prediction. So far so good. I did play with Spark a few times to figure out what it is and how it works, but the barrier to entry there seemed definitely higher: you have to worry about cluster configuration, memory etc. when you launch jobs and you have to use a lot of low level code (RDDs, closures etc.)

One of the knowledge exchange sessions with the provider was on Hadoop being used for ad hoc analysis. Their suggested process was: copy data to HDFS, load data into newly created Hive table, load data from Hive table into Spark dataframe, do certain operations, convert to RDD, do more operations. It seemed awfully complicated. When there was a need to do such analysis, I realised I needed to define a 44 column table schema when I only wanted to average one column grouped by the contents of another, and gave up on using Hadoop at all for the task. It bothered me that I didn't know how to do something this simple on Hadoop though, so I kept reading through books and searching online until Pig emerged as the obvious solution. The syntax for what I wanted to do was ridiculously easy:
file_data = LOAD 'hdfs://cluster/user/username/file.csv' USING PigStorage(',');
raw_data = FILTER file_data by $0!='field0_name';
data_fields = FOREACH raw_data GENERATE $11 AS file_dimension,  (int)$43 AS file_measure;
data_group = GROUP data_fields by file_dimension;
avg_file_measure = FOREACH data_group GENERATE group,AVG(data_fields.file_measure) AS file_measure_avg;
This example embodies certain aspects of Pig's philosophy: Pigs eat everything, without necessarily requiring a full schema or being particularly difficult about the delimiter field, or the presence of absence of the csv header (which I filter out in the second line of the example). Pig can go even further working with semi-structured and unstructured, non normalised data, that would be entirely unsuitable for Hive without serious processing. Pigs are domestic animals and rather friendly to the user. One of the early presentations on Pig stated that it "fits the sweet spot between the declarative style of SQL, and the low-level, procedural style of MapReduce". I would dare say that this statement could be updated for the Hadoop 2 world with Spark in place of MapReduce, so it is unsurprising that Pig is still heavily used for ETL and other work on Hadoop, and Pig on Spark is in the works (hopefully delivering on the Pigs fly promise). A final point that Pigs live anywhere should comfort anyone worried about learning such a niche language: it is also supported e.g. on Amazon EMR.

So in retrospect: An organisation can adopt Hadoop and throw all its data into a 'data lake' in HDFS. Any competent programmer in that organisation can then use an array of programming approaches (Pig, raw MapReduce, Spark) to analyse this data, some faster to program, others more powerful but requiring more programming effort. This is the fabled 'end of the data warehouse' but only possible if the users of the data can do their own programming. If on the other hand the organisation wants to enable access to the data to non programmer analysts, connect standard BI tools to the data etc. then they adopt Hive, but have to do a lot of the same work that is required for a traditional data warehouse: ETL, normalisation etc. The main advantage of Hive compared to traditional DWH is being able to cope with Big Data that would ground an RDBMS to a halt. In most cases probably a happy medium is reached where key data is in Hive tables, but a lot of other 'niche' data stays in non-structured or non-normalised formats in the data-lake. I have not addressed where NoSQL databases fit into this picture, I promise to come back on the subject when I have a similar NoSQL epiphany.

Tuesday, 28 March 2017

Statistics of statistics



In looking at the pie charts of CPD hours we found that the sum of averages was the average of the sums, whereas the sum of medians was much smaller than the median of the sums. This is another way of saying that the average is a linear function, i.e. it is true that


f(ax+by)=af(x)+bf(y)
whereas the median is non linear. This is quite important from a visualisation point of view as pie charts, stacked bar charts, area graphs etc. imply that the sum of the parts is a meaningful quantity, and in the case of non linear aggregations (median, countd) often it isn’t.

In tables Tableau addresses this with ‘grand total’, the aggregation for which doesn’t have to be a sum but could be e.g. an overall median. If you’ve been careful to never imply the parts can be summed but still find your users exporting data and doing sums in excel, adding a table view with a grand total configured to the suitable aggregation can save you from hours of arguing!

Another case of statistics of statistics in Tableau can arise when using Level of Detail Calculations. I used to do this manually by doing counts at ‘exact date’ level of detail, exporting the counts to excel, re-importing the counts to tableau and then finding the weekly maximum of the counts, effectively using Tableau as my data warehouse/ETL tool as well as the analysis and visualisation tool. The emergence of Level of Detail calculations saved me from all this bother, as now I could plot a
max({fixed [date]:count([x])})
against the date to the nearest week.

Of course there are also cases of using data from a proper data warehouse, whether the traditional RDBMS one or Hive. In that case again it is worth being careful to match any aggregation done in Tableau to the aggregation done in the data warehouse. e.g Min([min_x])can’t go wrong, but the averages can be a bit tricky. Say the original SQL was
SELECT avg(x) AS avg_x GROUP BY d
where d1 has 1 record and d2 has 100! Coming then in Tableau to do an avg([avg_x]) is just asking for trouble. Instead modify the SQL to
SELECT avg(x) AS avg_x, count(1) AS n GROUP BY d
and then we can get the proper average in Tableau with
SUM([avg_x]*[n])/SUM([n])

Wednesday, 15 March 2017

Big Data World day 1 review

This event was today at Excel in London. It is handily co-located with Cloud and Smart IoT events, which was a major draw for me, as I am not strictly speaking a data/IT guy, more of an analyst and wannabe subject matter expert with IoT and increasingly cloud within my professional interests. As the write up is quite long the following list of links lets you jump to the summary of the relevant presentation:


I did start from the Cloud Expo end of the room as there was a whole 'theatre' devoted to talks on software defined networks. Judging from the first presentation which looked at SD-WANs, these guarantee the exact opposite of net neutrality to business users, prioritising cloud business applications at the expense of e.g. YouTube traffic. The technology involves 'aggregating and orchestrating' a variety of links of varying performance and availability from MPLS and 3G to old school internet, and intelligent path control for different applications' traffic, e.g. routing VoIP traffic through MPLS, thus being able to guarantee QoS above 'best effort'. Interesting ideas on SLAs for application uptime and even application performance could extend the current SLAs of network availability. To my simplistic brain, it would all be simpler and safer if we keep most applications offline and standalone, instead of putting everything on the cloud, but there's a cloud enabled industrial revolution in progress, which involves cost cutting and getting rid of half the IT department. Clearly that is the main driver, with the downside being increasing demands on the network.

In complete contrast, I moved on to a presentation in the Big Data World area on the General Data Protection Regulation. This made me feel suddenly quite nostalgic for the good old EU, its much maligned courts and the human right to a private life. The list of non EU countries that conform  to european data protection standards is interesting to say the least: Andorra, Argentina, Canada, Faeroe Islands, Guernsey, Israel, Isle of Man, Jersey, New Zealand, Switzerland and Uruguay. Maybe there's material there for a tableau map and some spatial insights!

Next was SelectStar, a database monitoring solution. Their main selling point is being a single monitoring tool for all database technologies, as increasingly companies will have a mixture of open source and proprietary relational and NoSQL solutions, rather than be an e.g. 'all Oracle' house. I was hoping to get some tips on actual monitoring of the health of a Hadoop cluster, but they don't do much in addition to what the built in Hadoop monitoring tools offer, in fact Hadoop is quite a recent addition to their offerings.

I try and balance attending presentations relevant to my job to things that are just plain interesting. In the spirit of the latter, I went to the 'deep learning' keynote, which as one member of the audience noted, was using mostly image recognition examples. The presenter explained that they make good presentation material, and the technique has wider applications. The key thing I took from this was that the feature extraction is now also 'trainable', whereas in old school machine learning only the classification stage was. I'm not fully convinced machines are better than humans at deciding what the best feature vector is, and I should read up on any speech/music processing applications of this as I have a better theoretical grounding in that field. Would a machine ever come up with something as brilliant but crazy as the cepstrum?

I next attended a talk on deriving 'actionable insight' from data. This is now such a cliche that is a bit meaningless, especially when the whole brigade of viz whizz kids are into dashboard visualisations with little accompanying text, carrying little of the insight a well illustrated text heavy report will give you. Refreshingly, and perhaps tellingly on the differing priorities of the private and public sector data world, the speaker used no visual material whatsoever. He just talked about his work as a data analyst for Peterborough city council, and projects such as identifying unregistered rental properties by fusing a variety of datasets including council tax and library user details, or analysing crime patterns! I should look into what open data offerings they have.

The weather company, finally a (nearly) household name, was next, in a keynote on weather for business. They interestingly crowdsource some of their data to achieve 'hyperlocal' resolution of 500m, 96 times a day globally, while a major sector of their corporate customers, airlines, are also data providers.  They have a unique data processing platform and the speaker did put a certain emphasis on their acquisition by IBM and the 'insights and cognition' synergies that will result from it.

I then ventured into the Smart IoT area for a talk from O2/Telefónica. The data protection theme was present here again as they make sure their data is secure and anonymous via aggregation and extrapolation, and a certain amount of weeding out of any data that is not aggregated enough and therefore could be used to identify individual users, and users consenting to the use of their data also came up during question time. They derive location via both 'active' (calls) and 'passive' (cell handovers) means, as well as behavioural data from apps and websites visited, and 'in venue' information from their wifi hotspots and small cells (another hint at the seamless integration of multiple networks mentioned earlier). This builds up to 6 billion events on a typical day, and they keep an archive of 4 years. These events are classified into 'settles' and 'journeys', analysed to identify origins and destinations, with uses ranging from transport planning, audience profiling, retail decision making etc.

Back to the other end of the hall to hear self professed ubergeek Andi Mann of Splunk on DevOps . Follow the link as he can summarise his ideas much better than I can. He gave me another interesting fact on Yahoo's past as a cool tech parent company, despite today's headlines on security breaches: the presentation on dev and ops cooperation at Flickr that was key to starting the devops conversation back in 2009. I think the idea of devops has applications outside software, in fact a lot of the operational intelligence work I do sits somewhere between operations and product/service development.

I did skip the QlikView presentation for the sake of Splunk, but came back to the Big Data keynotes for Esri. Their presentation focused on the benefits of spatial analysis, giving the grid aggregation I wrote about in the past as an example, along with navigation, fraud transaction detection and even insurance for ships going into pirate infested waters!

Finally a joint act between Wejo and Talend on their connected car project. This was interesting for many reasons, from the technologies involved on the car such as eCall  to the technology used for data processing, as Talend offers an open source ETL tool that can sit on top of Spark. On that latter front there was a certain emphasis on the benefits of having a unified infrastructure for batch processing and streaming, and a mention of the Apache Beam project as something Talend will support in the future to that end.

In the time left between all this I got to talk to some of the exhibitors. The most intriguing one was a Belgian company who make Anatella, which would catch the eye of anyone working in telecoms for their binary CDR processing capability. The demos included some fascinating social network graph analysis derived from phone calls: e.g. flemish speaking belgians tend to call other flemish speaking belgians and french speaking belgians tend to call other french speaking belgians, with the exception of Brussels, which proves to be bilingual territory not only in constitutional theory but also in telephony practice! There was a news story a few years ago on similar telephone 'connections' between UK regions, I remember Wales being split in three north to south, with more contact with the corresponding adjacent english region to the east than the rest of wales north and/or south. Fascinating work that can escape the narrow confines of marketing into proper Geography research, not unlike the work of the Flowminder Foundation that I link to on the right.

Wednesday, 8 February 2017

The story of Pig

Yahoo! (don't forget the exclamation mark!) nowadays makes the news for all the bad reasons: Takeover by other companies, hacked by unnamed state actors several times in the past, not to mention the very dodgy advertising all over their websites. It wasn't always thus.

Cast your minds back to the mid noughties, and you'll remember Yahoo! acquiring the smaller stars of the web 2.0 constellation: Flickr which they kept, Delicio.us which they sold on and Upcoming which they retired. Of course Yahoo! had a history of acquiring companies with great products and messing them up previously, from GeoCities to LAUNCHcast. But by 2005 it seemed like they were suddenly getting it and becoming cool.

On the technology side, Yahoo! was a pioneer of Big Data, with Open Source projects such as Hadoop (the writer of that first blog post, Jeremy Zawodny, did later sum up the story of that time nicely in his personal blog), Pig and other bits of that ecosystem that became part of an Apache project rather than a proprietary product.

One wonders if they would be better off now had they kept it as their own product. Maybe they would be the giants of cloud computing. Releasing it as open source though meant that it became an effective industry standard, with other companies contributing projects such as Hive (in fact have a look at this blog post that details the use for both Pig and Hive inside Yahoo. If only someone updated it to add Spark SQL to the mix!). So if anything, if Yahoo! goes down, the Apache Hadoop ecosystem will probably survive.

Friday, 3 February 2017

Pig history, features, application and operations

Would you consider a training course with the agenda below?
• History of Pig
• What is Pig and Why Pig
• Pig Vs. MapReduce
• Features of Pig and Its Application
• Pig Data Model and Pig Operations
I think I'll stick to Hive .