Python Web Crawler/Spider

August 3rd, 2006

You’d think there’d be one wouldn’t you?

I’m trying to write/hack a reliable web crawler together in python. I’m really surprised that there isn’t a wonderful threadpool version or 10 for me to try out… but it doesn’t seem that there is…

I have a version that runs for a few thousand pages then it just pops off to lala-land. Maybe it is a url timeout thing, maybe urlparsing… either way it just freezes (but it’s still running).

The one I’ve found that looks both powerful and extensible is:spider.py

Apart from Google itself, the only other one worth mentioning is ThreadPoolSpider.py, which I have no idea where I found it… and although I could hack it (as in bloody mess) I couldn’t easily extend it…

Come on the PythonLazyWeb, help me out… I don’t want the moon on a stick… just a simple web crawler….

UPDATE: I forgot to mention HarvestMan… which well, is worth a mention…

I’m now having some successes… with the only problem being (probably) something to do with threading… whereby bits of code that should raise an exception just pause. I then hit control-C, the error is reported then it just carries on as if everything was OK. How the hell does that happen? It’s like giving the code a nudge when it drops off for a quick snooze.

Any ideas anyone? This one really has me stumped….

Share me:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • StumbleUpon
  • LinkedIn
  • SphereIt
  • Technorati
Tagged with:

Responses

  1. Ivan says:

    April 10th, 2009 at 7:29 pm (#)

    thx for the links

    threading is a bit of a mytery… but i have a book with examples which might help me figure it out.

    Check it out: http://www.amazon.ca/Python-Unix-Linux-System-Administration/dp/0596515820

Leave a Response