Thursday 28 October 2010

Expensive data

As a theoretical biologist a lot of my research involves burning a *lot* of CPU time on computer simulations of evolving animal populations.

Usually I run a program for thousands of generations, each of which consists of hundreds to thousands of time steps during each of which hundreds of individuals interact with each other and their environment. This has to be replicated a dozen or so times with different seeds of the random number generator. The whole thing has then to be repeated for each combination of parameters I'm interested in.

To give an idea of the scale: Running 10k generations (which has been argued to be far too little) of the simulation I am currently working on on a typical node of my university's cluster (newish multi-core Opterons) takes about 3-5 hours. One standard sweep of the parameter space has 64 parameter combinations (which leaves out so many fascinating possibilities that it hurts) times 10 replicates each, thus 640 runs (each of those sets produces 4-5 GB of data, by the way).
In a typical project I tend to rewrite and change the simulation program many times, first of all to find bugs but then also as part of an iterative process where I create the program and run it, look at the results, think about them, adjust my opinion about the model/question/approach, change the program, run it, etc.
For the latest incarnation of my current project (the 4th or 5th major one) I have now done 25 of the above mentioned parameter sets. That means for just one part of the project I have already used more than 7 CPU-years and produced more than 100GB of data. And that is by far not going to be the end of it...

No comments:

Post a Comment