...hmmm: clarity

Monday, 2 August 2010

Eta, Part II - Syntax (part I)

Many people have pointed out that language designers tend to obsess over syntax far too much and that their time would be better spent thinking about the semantics of their languages. Some (usually those who are either more academically inclined or old lispers) go so far as claiming that syntax is ultimately irrelevant, since a) which syntax someone prefers is largely a matter of taste anyways and b) every syntax becomes 'natural' after sufficient exposure.
Well, this topic has been discussed thoroughly, and I will only add to it to the extent that I am going to justify my own design decisions on the matter.
On the most abstract level a program can be thought of as a nested structure of operations being applied to sub-units which again consist of operations being applied to sub-sub-units, and so forth. Within a compiler this structure is usually represented as a so-called AST (abstract syntax tree).
If we print out an AST in parenthesized polish notation we would essentially end up with Lisp's syntax. This very elegant idea has a couple of advantages - it is extremely simple, easy to parse and totally generic (note though that the oft-heralded homoiconicity of Lisps is a red herring in my opinion - in every language that I know of it would not be difficult to represent a program's AST in the language itself).
On the other hand - at least for me - this genericity makes programs more difficult to read, especially at a glance, since it lacks redundancy. In Lisp the only carriers of information about the structure of a program are the names of operations and the nesting structure. In most main-stream languages however syntax is used as an additional redundant channel of communication. This redundancy makes it much easier to quickly grasp the structure of a piece of source code.
Have a look at this bit of C for example:

 3 struct Point
 4     {
 5     float x, y;
 6     };
 7 
 8 float point_dist(Point p1, Point p2)
 9     {
10     float dx = p2.x-p1.x, dy = p2.y-p1.y;
11     
12     return sqrt(dx*dx + dy*dy);
13     }

We can see that the same basic functionality is provided by very different syntactical elements depending on the context. The separation of terms for example is done by whitespace (top-level), ',' (declarations) and ';' (statements). Grouping is done by '()' (arithmetics, actually not shown in this example), operator precedence (arithmetics) and '{}' (statements). The application of an operation to arguments is expressed either in infix notation (arithmetics), prefix with '()' (function call), plain prefix (flow control keywords) or implicitly (declarations).
Of course this mess is far removed from the theoretical purity of Lisp's S-expressions. However it allows us to very quickly distinguish between different kinds of operations and different kinds of lists of terms. Looking for a declaration - spot names separated by whitespace, looking for function calls - find name + '()', and so on.
Redundancy therefore clearly serves to support readability (or "glanceability"). Too much of it on the other hand will certainly have an opposite effect. The optimal syntax will consequently add just enough redundancy to improve readability. (side note: There is also useless redundancy - Pascal is a lot more redundant than C, however mostly due to the fact that it uses keywords instead of punctuation and longer keywords. In my opinion this reduces readability. A similar argument could be made for Java.)
To maximize the effect of syntax it is also important that there is as little ambiguity in the correspondence between syntactic elements and semantic structure as possible. A nice counter-example is provided by C++. By "overloading" old syntax it becomes a lot harder to read (quickly) than C.
In Eta I wanted the overall look to stay somewhere in the vicinity of a traditional curly-brace language. At the same time I wanted it to be as simple and regular as possible while defining an unambiguous relationship between syntactic elements and semantics. (side note: This sounds a lot more goal-oriented than it was. Actually it took me quite a while to find out that these were the goals I was aiming for.)
This post is already long enough however, therefore I will postpone the details of Eta's syntax to the next post. As a small teaser the example from above rewritten in Eta:

1 Point @ type : (x @ float, y @ float)
2 
3 point_dist(p1 @ Point, p2 @ Point) @ float :
4     {
5     dx @ float : p2.x-p1.x
6     dy @ float : p2.y-p1.y
7     
8     <- sqrt` dx*dx + dy*dy
9     }

Monday, 26 October 2009

Scientific versus "regular" programming - part I

A huge part of the effort (and the resulting progress) in computer science is dedicated to making it easier for people to create better programs in a shorter amount of time. To this end new tools and methodologies are developed.
However if we zoom in a bit differences between different areas of application of programming become obvious. Consequently the demands placed on the required tools and methods differ as well.
In my field (theoretical biology) and I think generally in areas of science that require the development of simulation software programming happens under very special conditions that lead to a unique set of requirements for the process of software creation.

what is a good program?

Many clever people have written whole books on the topic and I am certainly not an expert, but in a nutshell a good program in most situations has to fulfill these criteria:

correctness - It has to do the things it is supposed to do (and only those).
efficiency - It has to do them using a reasonable amount of resources (time, memory, etc.).
maintainability - It has to be reasonably easy to change the program in the future.

Making it easier for people to make programs conform to these criteria (or at least find out whether they do) with a reasonable amount of effort has been the main driving force behind the development of new languages, platforms, IDEs, coding conventions, etc. Accordingly it is nowadays a *lot* easier to produce correct, efficient and maintainable code than it was, say thirty years ago.

Although similar the criteria for what makes a "good" program in a scientific context differ in important details.

efficiency

It has often been said (in many variations) that Moore's law made efficiency unimportant. This is certainly true in many areas as shown by the success of dynamic, interpreted (and horribly slow) languages such as Ruby or Python.
For someone who writes and uses simulation programs however time is always a limiting factor (disk space and memory are others but less so in recent years). Given more time (or higher execution speed) it is possible to test more parameter combinations, build in more details, run more replicates or observe more long term dynamics - all of which (might) lead to better results, which make for better publications which will bring more fame, fortune and general happiness.

execution speed is important

maintainability

The need to program in a way that makes it easy (or at least possible) to make changes to a program later on has lead to the evolution of whole industries.
In the scientific context this is only an issue for library code and tools. Most of the code written by the scientist herself is usually a one-off effort and stored in the virtual attic after publication of the corresponding paper(s).
There is a related issue of understandability and clarity of code but I will talk about this later.

maintainability is (with certain caveats) a minor problem

correctness

I think program correctness is maybe the aspect where scientific programming differs most from "mainstream" programming.
The correctness of an operating system or a game is determined by how much the respective program behaves according to specifications. Bugs are found by people running into situations where the program behaves in a way it shouldn't (crashes, rendering glitches, hangups, etc.). Observing the program is therefore ultimately the way most bugs are detected in such a situation. Luckily this also means that those bugs that have the strongest effect on program behavior tend to be the easiest to detect.
In a simulation on the other hand the behavior is the outcome of the program. The program is correct if the behavior is produced according to the specified rules. Of course simulations also have easily observable bugs (e.g. the program crashes) but these are not dangerous. Many errors however "just" lead to wrong results. These bugs are very dangerous because they can go entirely undetected while making the whole program (or at least the work done with it) effectively worthless. Especially in more complicated simulations in principle the only way to find these bugs is by rigorous examination of the source code.

correctness is essential, difficult to obtain and even more difficult to prove

clarity

This leads us directly to an additional criterion for program quality that is usually seen as a part of maintainability but in the context of scientific programming deserves in my opinion a bullet point on its own - clarity and legibility of the source code.
If some (serious) bugs can only be found by reasoning about the source code then it becomes of paramount importance to write the code in a way that makes it easy to reason about. In this sense clarity is a means to fulfill the correctness criterion.
In a scientific context however the source code of a program is more than just the intermediate stage towards producing an executable that can be used. An essential part of the way science happens is that one scientist's results have to be reproducible by other scientists. In the empirical fields that means that methods are published down to the last onerous detail. In a mathematical paper enough steps of a calculation are given that it is possible to retrace the authors' steps (for a suitable definition of 'possible'...). Given the notorious dissociation between source code and documentation the code ultimately is the authoritative source on what a simulation does (unfortunately there is no real standard for the publication of source code (yet), although usually most authors at least offer to provide the source code on request - but that is a different blog post). Source code therefore is also a means of communication between scientists and therefore should be written in a way that makes it as easy to understand as possible.
In my opinion this is a vastly underappreciated aspect of at least those programming courses for scientists that I am aware of.

clarity of source code is essential

It should be clear by now that producing a good program requires a specific approach in a scientific context. In the next part of this post I will explain which consequences the specific "socio-economic environment" has for scientific programming. Then I will explore the consequences for the design of better tools for scientific programming.

update (27/10/09 10:28)

Please also check out the interesting comments on reddit.

...hmmm

Monday, 2 August 2010

Eta, Part II - Syntax (part I)

Monday, 26 October 2009

Scientific versus "regular" programming - part I

Popular Posts

Labels

Followers

Blog Archive

About Me