Last week, in order to avoid working on more important things I started a small project I had wanted to do since quite a while. As mentioned before I am really not happy with C++ as a language to write simulations in. Unfortunately according to the Great Language Shootout all the nice languages are way too slow to be seriously considered.
However - as aptly explained on the shootout page - comparing the speed of programming languages and even the speed of implementations of programming languages is not very meaningful. Results will vary widely dependent on application area.
Therefore the only reasonable thing to do is to implement a benchmark which is representative of the kind of program one is interested in in all candidate languages. Which is exactly what I have started to do.
I have defined a reference model and implemented it in (very very basic) C++ and (C++ish) D (and CUDA which strictly speaking doesn't belong here. More on that in a later blog post - hopefully. I'm really bad at actually writing posts that I have announced before...).
To my great delight it turned out that the D program ran only about 10% longer than my basic C++ version. In principle this is clearly a small enough loss in speed to be compensated for by the nice improvements D offers over C++. I was really disappointed though (you can see how much I would like to abandon C++) when I found out a bit later that PGO (profile guided optimization) gives my C++ program another 20% boost. Add to that the fact that 64-bit D is still some way off and that value types are second class citizens in D and I'm not really sure whether I will do my next project in D or not. In any case it was nice to see the progress they have made. I am optimistic that my C++-days are numbered...
As for the shootout - I will try to find the time to add "proper" implementations in D and C++ (which hopefully will not perform much differently). I hope I will also be able to cover other interesting or up-and-coming languages, such as OCaml or Bitc. If anyone of my three readers feels willing and able to help out - just head over to the project page, read the model definition and hack away. I will be happy to post code/results.
Showing posts with label scientific programming. Show all posts
Showing posts with label scientific programming. Show all posts
Sunday, 31 October 2010
Thursday, 28 October 2010
Expensive data
As a theoretical biologist a lot of my research involves burning a *lot* of CPU time on computer simulations of evolving animal populations.
Usually I run a program for thousands of generations, each of which consists of hundreds to thousands of time steps during each of which hundreds of individuals interact with each other and their environment. This has to be replicated a dozen or so times with different seeds of the random number generator. The whole thing has then to be repeated for each combination of parameters I'm interested in.
To give an idea of the scale: Running 10k generations (which has been argued to be far too little) of the simulation I am currently working on on a typical node of my university's cluster (newish multi-core Opterons) takes about 3-5 hours. One standard sweep of the parameter space has 64 parameter combinations (which leaves out so many fascinating possibilities that it hurts) times 10 replicates each, thus 640 runs (each of those sets produces 4-5 GB of data, by the way).
In a typical project I tend to rewrite and change the simulation program many times, first of all to find bugs but then also as part of an iterative process where I create the program and run it, look at the results, think about them, adjust my opinion about the model/question/approach, change the program, run it, etc.
For the latest incarnation of my current project (the 4th or 5th major one) I have now done 25 of the above mentioned parameter sets. That means for just one part of the project I have already used more than 7 CPU-years and produced more than 100GB of data. And that is by far not going to be the end of it...
Usually I run a program for thousands of generations, each of which consists of hundreds to thousands of time steps during each of which hundreds of individuals interact with each other and their environment. This has to be replicated a dozen or so times with different seeds of the random number generator. The whole thing has then to be repeated for each combination of parameters I'm interested in.
To give an idea of the scale: Running 10k generations (which has been argued to be far too little) of the simulation I am currently working on on a typical node of my university's cluster (newish multi-core Opterons) takes about 3-5 hours. One standard sweep of the parameter space has 64 parameter combinations (which leaves out so many fascinating possibilities that it hurts) times 10 replicates each, thus 640 runs (each of those sets produces 4-5 GB of data, by the way).
In a typical project I tend to rewrite and change the simulation program many times, first of all to find bugs but then also as part of an iterative process where I create the program and run it, look at the results, think about them, adjust my opinion about the model/question/approach, change the program, run it, etc.
For the latest incarnation of my current project (the 4th or 5th major one) I have now done 25 of the above mentioned parameter sets. That means for just one part of the project I have already used more than 7 CPU-years and produced more than 100GB of data. And that is by far not going to be the end of it...
Thursday, 22 July 2010
Eta
In the last two weeks what began as a small attempt at writing a work-saving template system for individual-based simulations turned into my first sort-of somewhat working (but not so work-saving) compiler for Eta (or η).
Eta is a project I have been working on/thinking about since a couple of years already. It started off with my increasing dislike for all the syntactic and semantic warts of the bloated mess that is C++. At the time I was desperately looking for an alternative, however everything I found that promised enough performance was either immature or only slightly less warty and bloated (sidenote: I think D is a much better language than C++ and I really hope it catches on. Still, it's far from being a good language IMHO.).
So I did what everybody who should instead really, really work on his PhD thesis (and I'm *not* talking about a PhD in Computer Science) would do - I started thinking about how to design my own language.
As others have said language design tends to be more successful when it tries to scratch a personal itch than when it attempts to solve other people's problems. In my case I really wanted a language that would make it easier and more fun to implement the simulations I am working on. That means the language had to be
macros"string mixins", traits and conditional compilation?).
Apart from these general principles I had a couple of specific technical ideas about mechanisms I wanted to include in the language. I will leave the details on that, on how I implemented Eta and on how the language looks like currently to the next post, however.
Eta is a project I have been working on/thinking about since a couple of years already. It started off with my increasing dislike for all the syntactic and semantic warts of the bloated mess that is C++. At the time I was desperately looking for an alternative, however everything I found that promised enough performance was either immature or only slightly less warty and bloated (sidenote: I think D is a much better language than C++ and I really hope it catches on. Still, it's far from being a good language IMHO.).
So I did what everybody who should instead really, really work on his PhD thesis (and I'm *not* talking about a PhD in Computer Science) would do - I started thinking about how to design my own language.
As others have said language design tends to be more successful when it tries to scratch a personal itch than when it attempts to solve other people's problems. In my case I really wanted a language that would make it easier and more fun to implement the simulations I am working on. That means the language had to be
- fast. It *does* make a difference whether I have to wait 5 days or 8 days for a set of simulations to finish.
- strictly and statically typed. As I said before the major difficulty when writing simulations is that errors are often silent. Every bit of static guarantee the compiler can give helps.
- interoperable with C/C++. I'm not going to reimplement all the libraries I use.
- expressive. For reasons of efficiency dynamic operations are often out in simulations. Therefore similar redundant patterns start to show up at lots of different places. E.g. if I add a new trait to an individual it has to be read, set, initialized, mutated, written to the data file, read from a config file, etc. Some of it can be alleviated by some advanced template wizardry but in the end I sooner or later usually fall back to external code generation. My ideal language would have built-in compile-time macros to solve this problem.
- clear and unambiguous. One problem with C++ is that understanding what *exactly* a particular piece of code does can be non-trivial. Apart from syntactic idiosyncrasies things like silent shadowing of globals and implicit conversion rules make it necessary to be aware of a big amount of context to understand local semantics. This is especially a problem for simulations since (due to lack of external tests) code review plays an essential role in ensuring their correctness. A good language should therefore reduce the amount of context necessary to understand a piece of code as much as possible.
Apart from these general principles I had a couple of specific technical ideas about mechanisms I wanted to include in the language. I will leave the details on that, on how I implemented Eta and on how the language looks like currently to the next post, however.
Tuesday, 4 May 2010
C++ sucks for simulations
Today I once again had to realize that C/C++ is really a bad language to use for numerical simulations. In two heavily scrutinized (and often used) pieces of code I found two simple, yet far-reaching bugs which would immediately have been spotted by the compiler or the run-time system, respectively, had I used a decent language.
I implemented the first version of the simulation I am currently (again) working on about six years ago. Since then I made countless changes and refinements, although the general structure remained the same (yay structured programming). In particular the core functions of the model class (which implement what the model actually does) I tend to revise quite often in an iterative process of checking results, trying to understand them and coming up with new ways to test whether what I come up with is what's actually happening.
The stable and general part of the code I am producing I usually pull into a private library (which is available from sourceforge but so much a work in progress that I won't link to it) after some time.
One of the bugs occurred in the core model, the other one in the library, both in pieces of code I have checked and re-checked dozens of times. Both bugs were really, really stupid.
I will keep the story of how I actually found these bugs in the end for another blog post, suffice it to say that one of them is a Heisenbug, i.e. it had no effect in the debug version of the program.
Bug 1:
Without going into too much detail, the basic idea behind this code is that two individuals a and b have to negotiate their "role" (roleA and roleB) in the conflict. How this negotiation happens depends on the mode of role assignment each individual prefers (sym_a and sym_b). In one particular mode (sym_random) b automatically has the opposite role of a. Therefore in line 7 it should be roleA instead of sym_a.
I know, it's just a typo and a stupid one at that (although the code shown here is slimmed down quite a bit, in the original version the mistake is a bit more difficult to spot). Still, one reason it can go unnoticed is that C++ happily let's me convert between enum, int and bool without as much as a cringe (BTW, the fact that I have to use int for sym_a and sym_b instead of an enum is due to another quirk of C++ - enums don't play well with IO).
Bug 2:
Ok, this one is slightly embarassing. In a misguided attempt at optimisation I decided to rely on the input being valid (i.e. between 0 and 1). Which was sort of fine since in the original version I had two debug asserts checking for validity. Unfortunately it turned out that if I switched on SSE* code generation in gcc values of p==1.0 (which passed the asserts) lead to silent overflows in the multiplication so that the function always returned false. After I finally found it the bug was easily fixed by bypassing the comparison for values <=0 and >=1.
These two examples demonstrate two problems of C++ which make numerical programming significantly more error prone than necessary. Implicit conversion and silent arithmetic errors are by far not the most common sources of bugs in my programs. However if they occur the resulting bugs are among the hardest to find.
Unfortunately all languages that would offer enough static and dynamic checking to avoid these types of bugs come with a strong efficiency penalty (and, yes, 50% longer runtime *is* too much). It seems for now we will just have to make do with a bad language.
I implemented the first version of the simulation I am currently (again) working on about six years ago. Since then I made countless changes and refinements, although the general structure remained the same (yay structured programming). In particular the core functions of the model class (which implement what the model actually does) I tend to revise quite often in an iterative process of checking results, trying to understand them and coming up with new ways to test whether what I come up with is what's actually happening.
The stable and general part of the code I am producing I usually pull into a private library (which is available from sourceforge but so much a work in progress that I won't link to it) after some time.
One of the bugs occurred in the core model, the other one in the library, both in pieces of code I have checked and re-checked dozens of times. Both bugs were really, really stupid.
I will keep the story of how I actually found these bugs in the end for another blog post, suffice it to say that one of them is a Heisenbug, i.e. it had no effect in the debug version of the program.
Bug 1:
1 const int sym_a = a.symmetry();
2 const int sym_b = b.symmetry();
3 enum {sym_random = 0, sym_noRoles = 1};
4
5 const bool roleA = sym_a == sym_noRoles ? true : rng(2);
6 const bool roleB = sym_b == sym_noRoles ? true :
7 (sym_a == sym_random ? !sym_a : rng(2));
Without going into too much detail, the basic idea behind this code is that two individuals a and b have to negotiate their "role" (roleA and roleB) in the conflict. How this negotiation happens depends on the mode of role assignment each individual prefers (sym_a and sym_b). In one particular mode (sym_random) b automatically has the opposite role of a. Therefore in line 7 it should be roleA instead of sym_a.
I know, it's just a typo and a stupid one at that (although the code shown here is slimmed down quite a bit, in the original version the mistake is a bit more difficult to spot). Still, one reason it can go unnoticed is that C++ happily let's me convert between enum, int and bool without as much as a cringe (BTW, the fact that I have to use int for sym_a and sym_b instead of an enum is due to another quirk of C++ - enums don't play well with IO).
Bug 2:
1 /** Gives true with a probability of p. */ 2 bool choice(float p) 3 { 4 return (*this)() < Type((this->getMax()) * p); 5 }
Ok, this one is slightly embarassing. In a misguided attempt at optimisation I decided to rely on the input being valid (i.e. between 0 and 1). Which was sort of fine since in the original version I had two debug asserts checking for validity. Unfortunately it turned out that if I switched on SSE* code generation in gcc values of p==1.0 (which passed the asserts) lead to silent overflows in the multiplication so that the function always returned false. After I finally found it the bug was easily fixed by bypassing the comparison for values <=0 and >=1.
These two examples demonstrate two problems of C++ which make numerical programming significantly more error prone than necessary. Implicit conversion and silent arithmetic errors are by far not the most common sources of bugs in my programs. However if they occur the resulting bugs are among the hardest to find.
Unfortunately all languages that would offer enough static and dynamic checking to avoid these types of bugs come with a strong efficiency penalty (and, yes, 50% longer runtime *is* too much). It seems for now we will just have to make do with a bad language.
Monday, 26 October 2009
Scientific versus "regular" programming - part I
A huge part of the effort (and the resulting progress) in computer science is dedicated to making it easier for people to create better programs in a shorter amount of time. To this end new tools and methodologies are developed.
However if we zoom in a bit differences between different areas of application of programming become obvious. Consequently the demands placed on the required tools and methods differ as well.
In my field (theoretical biology) and I think generally in areas of science that require the development of simulation software programming happens under very special conditions that lead to a unique set of requirements for the process of software creation.
what is a good program?
Many clever people have written whole books on the topic and I am certainly not an expert, but in a nutshell a good program in most situations has to fulfill these criteria:
However if we zoom in a bit differences between different areas of application of programming become obvious. Consequently the demands placed on the required tools and methods differ as well.
In my field (theoretical biology) and I think generally in areas of science that require the development of simulation software programming happens under very special conditions that lead to a unique set of requirements for the process of software creation.
what is a good program?
Many clever people have written whole books on the topic and I am certainly not an expert, but in a nutshell a good program in most situations has to fulfill these criteria:
- correctness - It has to do the things it is supposed to do (and only those).
- efficiency - It has to do them using a reasonable amount of resources (time, memory, etc.).
- maintainability - It has to be reasonably easy to change the program in the future.
Making it easier for people to make programs conform to these criteria (or at least find out whether they do) with a reasonable amount of effort has been the main driving force behind the development of new languages, platforms, IDEs, coding conventions, etc. Accordingly it is nowadays a *lot* easier to produce correct, efficient and maintainable code than it was, say thirty years ago.
Although similar the criteria for what makes a "good" program in a scientific context differ in important details.
efficiency
It has often been said (in many variations) that Moore's law made efficiency unimportant. This is certainly true in many areas as shown by the success of dynamic, interpreted (and horribly slow) languages such as Ruby or Python.
For someone who writes and uses simulation programs however time is always a limiting factor (disk space and memory are others but less so in recent years). Given more time (or higher execution speed) it is possible to test more parameter combinations, build in more details, run more replicates or observe more long term dynamics - all of which (might) lead to better results, which make for better publications which will bring more fame, fortune and general happiness.
execution speed is important
maintainability
The need to program in a way that makes it easy (or at least possible) to make changes to a program later on has lead to the evolution of whole industries.
In the scientific context this is only an issue for library code and tools. Most of the code written by the scientist herself is usually a one-off effort and stored in the virtual attic after publication of the corresponding paper(s).
There is a related issue of understandability and clarity of code but I will talk about this later.
maintainability is (with certain caveats) a minor problem
correctness
I think program correctness is maybe the aspect where scientific programming differs most from "mainstream" programming.
The correctness of an operating system or a game is determined by how much the respective program behaves according to specifications. Bugs are found by people running into situations where the program behaves in a way it shouldn't (crashes, rendering glitches, hangups, etc.). Observing the program is therefore ultimately the way most bugs are detected in such a situation. Luckily this also means that those bugs that have the strongest effect on program behavior tend to be the easiest to detect.
In a simulation on the other hand the behavior is the outcome of the program. The program is correct if the behavior is produced according to the specified rules. Of course simulations also have easily observable bugs (e.g. the program crashes) but these are not dangerous. Many errors however "just" lead to wrong results. These bugs are very dangerous because they can go entirely undetected while making the whole program (or at least the work done with it) effectively worthless. Especially in more complicated simulations in principle the only way to find these bugs is by rigorous examination of the source code.
correctness is essential, difficult to obtain and even more difficult to prove
clarity
This leads us directly to an additional criterion for program quality that is usually seen as a part of maintainability but in the context of scientific programming deserves in my opinion a bullet point on its own - clarity and legibility of the source code.
If some (serious) bugs can only be found by reasoning about the source code then it becomes of paramount importance to write the code in a way that makes it easy to reason about. In this sense clarity is a means to fulfill the correctness criterion.
In a scientific context however the source code of a program is more than just the intermediate stage towards producing an executable that can be used. An essential part of the way science happens is that one scientist's results have to be reproducible by other scientists. In the empirical fields that means that methods are published down to the last onerous detail. In a mathematical paper enough steps of a calculation are given that it is possible to retrace the authors' steps (for a suitable definition of 'possible'...). Given the notorious dissociation between source code and documentation the code ultimately is the authoritative source on what a simulation does (unfortunately there is no real standard for the publication of source code (yet), although usually most authors at least offer to provide the source code on request - but that is a different blog post). Source code therefore is also a means of communication between scientists and therefore should be written in a way that makes it as easy to understand as possible.
In my opinion this is a vastly underappreciated aspect of at least those programming courses for scientists that I am aware of.
clarity of source code is essential
It should be clear by now that producing a good program requires a specific approach in a scientific context. In the next part of this post I will explain which consequences the specific "socio-economic environment" has for scientific programming. Then I will explore the consequences for the design of better tools for scientific programming.
update (27/10/09 10:28)
Please also check out the interesting comments on reddit.
Although similar the criteria for what makes a "good" program in a scientific context differ in important details.
efficiency
It has often been said (in many variations) that Moore's law made efficiency unimportant. This is certainly true in many areas as shown by the success of dynamic, interpreted (and horribly slow) languages such as Ruby or Python.
For someone who writes and uses simulation programs however time is always a limiting factor (disk space and memory are others but less so in recent years). Given more time (or higher execution speed) it is possible to test more parameter combinations, build in more details, run more replicates or observe more long term dynamics - all of which (might) lead to better results, which make for better publications which will bring more fame, fortune and general happiness.
execution speed is important
maintainability
The need to program in a way that makes it easy (or at least possible) to make changes to a program later on has lead to the evolution of whole industries.
In the scientific context this is only an issue for library code and tools. Most of the code written by the scientist herself is usually a one-off effort and stored in the virtual attic after publication of the corresponding paper(s).
There is a related issue of understandability and clarity of code but I will talk about this later.
maintainability is (with certain caveats) a minor problem
correctness
I think program correctness is maybe the aspect where scientific programming differs most from "mainstream" programming.
The correctness of an operating system or a game is determined by how much the respective program behaves according to specifications. Bugs are found by people running into situations where the program behaves in a way it shouldn't (crashes, rendering glitches, hangups, etc.). Observing the program is therefore ultimately the way most bugs are detected in such a situation. Luckily this also means that those bugs that have the strongest effect on program behavior tend to be the easiest to detect.
In a simulation on the other hand the behavior is the outcome of the program. The program is correct if the behavior is produced according to the specified rules. Of course simulations also have easily observable bugs (e.g. the program crashes) but these are not dangerous. Many errors however "just" lead to wrong results. These bugs are very dangerous because they can go entirely undetected while making the whole program (or at least the work done with it) effectively worthless. Especially in more complicated simulations in principle the only way to find these bugs is by rigorous examination of the source code.
correctness is essential, difficult to obtain and even more difficult to prove
clarity
This leads us directly to an additional criterion for program quality that is usually seen as a part of maintainability but in the context of scientific programming deserves in my opinion a bullet point on its own - clarity and legibility of the source code.
If some (serious) bugs can only be found by reasoning about the source code then it becomes of paramount importance to write the code in a way that makes it easy to reason about. In this sense clarity is a means to fulfill the correctness criterion.
In a scientific context however the source code of a program is more than just the intermediate stage towards producing an executable that can be used. An essential part of the way science happens is that one scientist's results have to be reproducible by other scientists. In the empirical fields that means that methods are published down to the last onerous detail. In a mathematical paper enough steps of a calculation are given that it is possible to retrace the authors' steps (for a suitable definition of 'possible'...). Given the notorious dissociation between source code and documentation the code ultimately is the authoritative source on what a simulation does (unfortunately there is no real standard for the publication of source code (yet), although usually most authors at least offer to provide the source code on request - but that is a different blog post). Source code therefore is also a means of communication between scientists and therefore should be written in a way that makes it as easy to understand as possible.
In my opinion this is a vastly underappreciated aspect of at least those programming courses for scientists that I am aware of.
clarity of source code is essential
It should be clear by now that producing a good program requires a specific approach in a scientific context. In the next part of this post I will explain which consequences the specific "socio-economic environment" has for scientific programming. Then I will explore the consequences for the design of better tools for scientific programming.
update (27/10/09 10:28)
Please also check out the interesting comments on reddit.
Subscribe to:
Posts (Atom)