Yesterday was a funny day, so funny (as in not even slightly) I need to get it off my chest.
I have a geek problem. I am writing crawlers that go off and collect information about web sites that are some how linked. A simple example is that having identified a few hundred competitors I then set out to try and gather information about them. I call this collection a cloud, not because its big but because its fluffy and random and difficult to manipulate, like clouds.
My problem is this. Once I’ve gathered a lot of data, I then find some new data that I want to add in. Adding in new data is easy if you don’t want to mess with your existing data but a spectacular pain if you have to start changing your database (and it contains 2 or 3 GB of stuff).
For a while I was happy with a tool called Django Evolution that lets you change your data model without exporting all you data and bringing it back in again… but it breaks. And Django Evolution really doesn’t like it if you want to change a field from TextField( “23.45″) into FloatField( 23.45 ). And it really doesn’t like it if you want to make big changes, evolution, in this case only works in very, very, small steps.
My other problem is this. At times I want to add in an arbitrary lump of data, for example a spreadsheet. I don’t want to have to manually create a table for it (with over a hundred columns) either, I just want to add it and see if it adds any value to my cloud. You’d think there’d be a (python) tool for importing a spreadsheet into MySQL wouldn’t you? The MySQL LOAD FILE needs you to define all the columns first. At this point I can vaguely remember than on Windows you could set things up so any file/database whatever was a data source, which right now sounds like a good idea.
Then I found Picalo a tool for Data Analysis and Fraud Detection. Yes, Fraud Detection! It hasn’t spotted me yet though. It’s a lovely tool (written in Python) that “in theory” lets you wire together MySQL databases and spreadsheets. Then you can create tables based on queries or python code so complex I really don’t understand it at all. It’s like proper maths and statistics… as in scary. The only problem with it, is that it doesn’t want to play with my databases… So tantalisingly close to a solution but not quite.
Picalo may still have legs though, I’ve not written it off just yet. It reminds me of a tool I used to use (I forget it’s name) for statistical analysis of web site log files. A practice which back then taught me, because the tool crashed so often, that I’d simply “gathered too much data” to be able to do anything useful with it. Mm?! Maybe I should pay attention to that memory.
My real problem is this. Relational databases are a pain. They aren’t really suited to what I’m trying to achieve. At best they’re slow to work with. At worst, I spend more time making the database work than making the data work…. if you know what I mean.
In desperation I tried adding the ZODB (Zope’s object database) to Django. Working with the ZODB brought back fond memories of when Zope was simple enough for me to use. I love the idea that you just make a python class persistent and it “just gets stored”. I’m very tempted. Thinking about it I should maybe use MySQL for the main part of my data and add an object layer of meaning on top of it… which is sort of what I did with Spinalot back in the day. Surely there has to be a better way. One of the things I “like” about MySQL is that when used with Django it’s very easy to create interfaces to explore my data, interfaces that can be adapted and tweaked on-the-fly.
I guess than Doug and Andy would tell me to have a go with DBXML like they have… but here my problems are these…. firstly, it looks difficult and secondly, until I get all my data in I don’t know if I’ll be able to get it in or get it out again.
There. A load of problems with no real answers… just what you wanted eh? Whinge over… for now.
I guess my ultimate problem, and maybe the answer to all my problems is this… I need geeky collaborators who are much better at all this than me. So if all of the above sounds understandable and easily fixable, do get in touch.