Thanks to the inspiration and help from Tony Hirst I managed to create a mini intelligence engine using python, GoogleDocs and Yahoo Pipes.
A friend and colleague was interested in setting up a meeting at BETT next week at an event called TeachMeet. I checked out the web page and discovered it was a wiki where everyone added their names in order to register. Interestingly, they also added their blog addresses and twitter identities.
Now at this point, knowing some of the people but not others I wanted to see what they were all writing about in their blogs. This looked like an interesting bunch of people, but I wasn’t about to follow all those blog links or subscribe in an RSS reader to all those blogs, that would take too long and mess up my already messy RSS subscriptions.
The solution for me normally would mean writing python crawlers to get the data, save the blog entries and then display the entries, but for a “one-off” application this would have been too much work. I needed something that made use of some of the great Web2.0 tools out there like Tony Hirst does.
The solution needs to be able to…
- Get the URLs from the web page (avoiding standard site links)
- Follow those URLs to the pages
- Get the RSS feed of these pages (dealing with redirects to FeedBurner etc)
- Get the items from the RSS feeds
- Display them
The hardest part was how to grab the links from the page. I had a go with Dapper.net, with which I’ve had some successes before. Here is a fairly messy list of links as an RSS feed made with Dapper. The Pipe I created to handle this feed, despite having filters, couldn’t quite de-messify it. Sometimes it seems that when working with tools like Dapper, Yahoo Pipes and Google Docs to gather and collate information there is always a small hand-made shim that needs to be custom-made and slotted in at the right moment.
I resorted to using python. Here is a script called get_blog_feeds.py that you point at a web page with lots of links in it and it creates a spreadsheet with list that contains…
- the blog link
- the rss link
- the blog’s name
When it runs it looks like this…
I then imported the created spreadsheet into GoogleDocs.
You have to remember to set up the spreadsheet to “Publish as a web page” and then get the CSV url (see below).
The next step involves creating a Yahoo Pipe to get the blog items of each of those URLs. The trick here (thanks Tony) was to use an embedded Pipe to annotate each feed with the feed’s title, making the end results much more human-friendly…
I also discovered that a feed with tweets (and no visual styling for those tweets) was just too visually noisy.
It’s a shame I had to drop python in there to fish out a list of RSS feeds because that makes this solution only work for people happy with using the command line. If anybody has any suggestions on how gather this info using Web2.0 tools, do let me know.
And the end result? The end result is an on-the-fly aggregator that you or I could use whenever we find a page full of interesting people to discover “what they’re saying”…. I found some really interesting stuff