Web Data Add-in for Excel

Simplicity drives adoption. That is a maxim that has been true for the web since the beginning. There are probably more tools and tutorials that help authoring in HTML more than any other. It would even be safe to say HTML has to be the most widely used/known language on earth.

But adoption comes at a price. It is rare that the original vision of the author persists in its purity over a period of time. Imagine a pristine spring gushing from the rare heights of a snow capped mountain; the subsequent journey in a gravity guided, rock and crevice hindered path to its destination; And imagine what it collects on the way downwards. That is what adoption does to an idea.

This dilution is not a disadvantage per se, until you want to do something with the result of this adoption. Imagine wanting to automate extraction of some sort of structured data from a web page. Should not be a problem as such – do some investigation into the structure of the HTML generated. Identify the data elements you need. There will be some elements of the layout that will guide how you reach your data. With these as landmarks you would write some text parsing code to pick out your interests.

Now what to do if this has to be replicated across many different sites. Your assumptions around using layout to guide how to reach your data might not be relevant anymore. So your code has to be adapted to suit each site. And therein lies the rub – each site will have to be catered for individually. If for some reason the HTML of the site changes, you will have to adapt the extraction logic again.

The primary problem is in using what is essentially a view markup language to infer structure of the data that is displayed.

It is in this context that this blog post perked up my interest. Am noting it here for my own reference. An Excel Add-In that automates this extraction, while being resilient to changes in the layout of the data displayed, can only simplify much of the data extraction problems from the Web.

I do not know the details of how this problem has been solved but my wish list would be to have the core extraction piece to be a library, so that it can used in 3rd Party apps, without having to go via Excel itself.

