Python Web Crawler/Spider
August 3rd, 2006
You’d think there’d be one wouldn’t you?
I’m trying to write/hack a reliable web crawler together in python. I’m really surprised that there isn’t a wonderful threadpool version or 10 for me to try out… but it doesn’t seem that there is…
I have a version that runs for a few thousand pages then it just pops off to lala-land. Maybe it is a url timeout thing, maybe urlparsing… either way it just freezes (but it’s still running).
The one I’ve found that looks both powerful and extensible is:spider.py
Apart from Google itself, the only other one worth mentioning is ThreadPoolSpider.py, which I have no idea where I found it… and although I could hack it (as in bloody mess) I couldn’t easily extend it…
Come on the PythonLazyWeb, help me out… I don’t want the moon on a stick… just a simple web crawler….
UPDATE: I forgot to mention HarvestMan… which well, is worth a mention…
I’m now having some successes… with the only problem being (probably) something to do with threading… whereby bits of code that should raise an exception just pause. I then hit control-C, the error is reported then it just carries on as if everything was OK. How the hell does that happen? It’s like giving the code a nudge when it drops off for a quick snooze.
Any ideas anyone? This one really has me stumped….













April 10th, 2009 at 7:29 pm (#)
thx for the links
threading is a bit of a mytery… but i have a book with examples which might help me figure it out.
Check it out: http://www.amazon.ca/Python-Unix-Linux-System-Administration/dp/0596515820