Wednesday, June 21, 2006

Python development history by the numbers

Here are some more statistics and milestones in the history of Python development. One of the key events in the development was the migration of Python's CVS repository to SourceForge, which Guido announced on May 21, 2000. It represented the first time that people other than Guido's direct co-workers had direct access to the CVS repository.

At about the same time that Python started using SourceForge, Greg Wilson alerted us to the recently started Subversion project at We eventually started using Subversion five years later. Martin v. Loewis announced the migration on Oct. 19, 2005.

Given those couple of simple milestones, I wondered how many people had commit privileges for Python CVS over time. The Python Subversion/CVS history starts on Aug 9, 1990 and Guido remains the only committer until 1992 when Jack Jansen and Sjoerd Mullender start making changes. A small number of people got additional privileges over the years.
  • 1994: Barry Warsaw
  • 1995: Fred Drake and Roger Masse
  • 1997: Jeremy Hylton
  • 1998: Andrew Kuchling, Ken Manheimer, Greg Ward
  • 1999: Greg Stein, Just van Rossum
This list doesn't capture the people who were making contributions by sending patches. Tim Peters was making contributions from the early days. Many of my earlier checkins were adding code written by other people, including Andrew Kuchling's zlib improvements, Greg Ewing's extended call syntax patch, and Neil Schemenauer's garbage collection patches.

The migration to SourceForge allowed many more people to have access. There were nine committers in 1999 and 35 in 2000. The number grew to about 49 in 2004. This year 36 different people have made checkins.

We can also count the total number of checkins by year, which reflects the same kind of explosion of activity in 2000 when 26 new developers started committing code. The PythonLabs team (Barry, Fred, Guido, Tim, and I) were working full-time on Python for a good part of 2000, so that also contributed to the surge in developer activity.

The raw number of checkins per year was:
  • 1990: 105 checkins
  • 1991: 445 checkins
  • 1992: 627 checkins
  • 1993: 287 checkins
  • 1994: 578 checkins
  • 1995: 1249 checkins
  • 1996: 1544 checkins
  • 1997: 2159 checkins
  • 1998: 2718 checkins
  • 1999: 1897 checkins
  • 2000: 4035 checkins
  • 2001: 5508 checkins
  • 2002: 4238 checkins
  • 2003: 3563 checkins
  • 2004: 2529 checkins
  • 2005: 1273 checkins
  • 2006: 1599 checkins

Wednesday, June 14, 2006

Using indentation to represent program structure

I've heard Guido mention that Donald Knuth made an early suggestion that future programming languages would use indentation as a structuring mechanism. I decided to dig into that history a little bit today. Knuth quotes D. V. Schorre who writes
Since the summer of 1960, I have been writing programs in outline form, using conventions of indentation to indicate the flow of control.
Knuth quoted this in the context of his long paper "Structured Program with go to Statements" (p. 295) where he wrote:
It seems clear that languages somewhat different from those in existence today would enhance the preparation of structured programs. We will perhaps eventually be writing only small modules which are identified by name as they are used to build larger ones, so that devices like indentation, rather than delimiters, might become feasible for expressing local structure in the source language.
Knuth also mentions Peter Landin's ISWIM family of languages. In "The Next 700 Programming Languages," Landin discusses four levels of abstraction in programming languages--physical, logical, abstract, and applicative expressions. The physical and logical languages involve syntactic issues like grammatical rules for grouping textual elements. Landin also mentions that ALGOL "sought to avoid any commitment to any particular sets of characters or type faces."

Landin's preferred representation for ISWIM used indentation. The paper has a section that lists differences between ISWIM and LISP, including the "textual appearance of ISWIM."
(c) Indentation, used to indicate program structure. A physical ISWIM can be defined in terms of an unspecified parameter: a subset of phrase categories, instances of which are restricted in layout by the following rule called "the offside rule." The southeast quadrant that just contains the phrase's first symbol must contain the entire expression, except possibly for bracketed subsegments.... It is based on vertical alignment, not character width, and hence is equally appropriate in handwritten, typset or typed texts.
In the next sentence, he observes that indentation is not mandatory. It can be freely mixed with more conventional punctuation.

The discussion, presumably from the conference, starts with a discussion of indentation. Peter Naur raises an immediate objection, oen that is raised today about Python programs.
Regarding indentation, in many ways I am in sympathy with this, but I believe that if it came about that this notation were used for very wide communcation and also publication, you would regret it because of this kind of rearrangement of manuscripts done in printing.
Robert Floyd also objected:
It works on the micro-scale--that is, one page is all right--when dealing with an extensive program, turning from one page to the next there is no obvious way of indicating how far the indentation stretches because there is no printing at all to indicate how far you have indented.

Donald E. Knuth
Structured Programming with go to Statements
ACM Computing Surveysm, Vol. 6, No. 6, December 1974
pp. 261-301

D.V. Schorre
Improved organization for procedural languages
Technical memo TM 3086/002/00, Systems Development Corp., Santa Monica, Ca.
Sept. 8, 1966
(I haven't seen the original. This citation is just a copy of Knuth's.)

P. J. Landin
The Next 700 Programming Languages
Communications of the ACM, Vol. 9, No. 3, March 1966
pages 157-166

The paper was presented at the ACM Programming Languages and Pragmatics Conference, Aug. 8-12, 1965 in San Dimas, Ca. The printed version of paper includes some discussion, presumably from that conference.

Wednesday, June 07, 2006

Search and Sort

Two recent posts had interesting things to say about basic searching and sorting algorithms. Josh Bloch found a bug in the JDK binary search implementation. The bug is that the code to pick the mid point can overflow if the array being searched has more than a billion or so elements, leading to a nonsense value (negative number). My first reaction was that Python doesn't have this problem. Adding two numbers does the natural thing--promotes the result from int to long.

A contemporaneous post from Jim Roskind, who wrote the Python profile module among other things, talks about a bad input to quicksort.

We were typically sorting an array of length 1000. If I remember the numbers correctly, the profiler revealed that the hot comparison function was being called over 400,000 times per sort. This was a whole lot closer to what I’d expect from an n-squared sort algorithm than from an n * log(n) quick sort. Remembering that algorithms 101 taught that quick sort *can* be an n-squared sort if the pivots are “chosen poorly,” and this could commonly happen (with a simple quick sort implementation) when the list was already mostly sorted… I took a wild guess that this was the problem. To take a shot at working around this quick sort implementation, I needed to randomize the list prior to the sort. With Python, this was a snap (to try). I just put the list into a hash table, and then pulled it out (in its pseudo random order) before doing the sort. Sure enough, a factor of 20 performance improvement resulted (for our application), and a simple bug report went off to Guido. I always liked this story because it was a case where adding wasteful code was VERY helpful.
Note to Grail users: You can visual quicksort (probably not for billion-element inputs) using my visualization applet.

Note: Jim's post about quicksort stuck in my mind, somehow, and I read Josh's post as if it were about quicksort, which doesn't make any sense. One benefit of that gaffe is that Josh referred me to Engineering a Sort Function by Bentley and McIlroy.