Showing posts with label data collection. Show all posts
Showing posts with label data collection. Show all posts

Tuesday, 25 December 2018

Managing the monthly Flow allocation using a scheduled ftp file grab

Running a flow for every new file in the FTP repository is eating away my monthly allowance
Therefore since I'm not interested in real time monitoring, a more efficient approach is to only fetch files once a day. Note that the ftp directory list gives a body on which we can run a 'foreach' loop, to then only select the wanted files whose filenames satisfy the condition.
If you got this far, Merry Xmas!!!

Monday, 17 December 2018

Operational Intelligence basics: Looking for data loss with a scheduled data feed

It is worrying how many people, from the novice visualisation enthusiast to the experienced data scientist, just assume all is well with the underlying dataset and go on to visualise it, feed it through algorithms etc. 

My first use of Flow was to capture DX cluster data. I was requesting an XML every few hours, which contained 500 records. Now the websource is designed for near real time monitoring, not for people like me to download a complete archive. So sometimes 500 records do not go far back enough to avoid having a gap with the previous fetch, as is shown in the morning of the 9th December below. Colouring by the filename (effectively the date and time of fetch) helpfully shows that the gap corresponds to a colour change, and very possibly means there was data loss. On the other hand the gap on the 12th December happens 'within' a particular colour band, i.e. in the middle of a fetched file. So it is probably a genuine lack of activity rather than data loss. I have subsequently changed my flow to do more frequent fetches. 

But what about the actual lack of activity? We look in a bit more detail by adding frequency in 10s f kHz on the vertical axis and being a bit more careful about accuracy of the thickness on the horizontal axis using the calculated field constant 1/(24*60) for the minute.
Now compare this slow dying down and picking up again to a data gap below:

Sunday, 16 December 2018

Conditionals in Microsoft Flow for RSS feed processing

For the Flow aficionados amongst you, check out the Flow online conference videos on YouTube.
 
I've referred to the golden age of web 2.0 in Yahoo! in a previous post. One of the cool products of that time was Yahoo! Pipes that provided effectively pipework for RSS feeds. 

With similar applications in mind, Microsoft flow can be a worthy replacement. I needed to only retrieve the web page linked from an RSS feed if the title contains a particular keyword. See the screenshot below. The flow web editor does not allow for dragging and dropping an action into the alternative paths following the conditional. I guess this will be a possibility when editing Flows in Visio becomes available. This might look like a peculiar combination to those of us who think of Visio as a tool for drawing diagrams, but Visio already has Sharepoint workflow functionality, and Microsoft Flow is to replace Sharepoint workflows as you'll hear in the conference videos.


Saturday, 8 December 2018

Fixing the flaws of Flow

In a previous post I complained about Flow not being able to compress files on the go. The issue with that was that as my files were not stored locally by default, once I started the archiving process with 7zip it would first cause the download of the file locally, then add it to the archive. Often with a flaky network something would disrupt the download, which would then mess up with the archive. The workaround has been to specify that all the files in that folder are stored locally, thus decoupling the download from the archiving.

I know, not something to write home about. Let's look at the config of the HTTP component in more detail. Oops! There is an automatic decompression button set to 'On' by default! Make sure it's off unless you really want decompression.

As I had a few failed runs, I've changed the retry policy, from the exponential default starting with an interval of seconds, to a fixed interval with quite a few minutes interval. I find this works better for a website that has gone temporarily down, especially given that I don't particularly want the flow to be responding in real time. The fact that this can be done from inside the config rather than by adding more steps to the flow is something I learned from the great Serge Luca in his Sharepoint Saturday presentation.