Wednesday, June 07, 2006

Search and Sort

Two recent posts had interesting things to say about basic searching and sorting algorithms. Josh Bloch found a bug in the JDK binary search implementation. The bug is that the code to pick the mid point can overflow if the array being searched has more than a billion or so elements, leading to a nonsense value (negative number). My first reaction was that Python doesn't have this problem. Adding two numbers does the natural thing--promotes the result from int to long.

A contemporaneous post from Jim Roskind, who wrote the Python profile module among other things, talks about a bad input to quicksort.

We were typically sorting an array of length 1000. If I remember the numbers correctly, the profiler revealed that the hot comparison function was being called over 400,000 times per sort. This was a whole lot closer to what I’d expect from an n-squared sort algorithm than from an n * log(n) quick sort. Remembering that algorithms 101 taught that quick sort *can* be an n-squared sort if the pivots are “chosen poorly,” and this could commonly happen (with a simple quick sort implementation) when the list was already mostly sorted… I took a wild guess that this was the problem. To take a shot at working around this quick sort implementation, I needed to randomize the list prior to the sort. With Python, this was a snap (to try). I just put the list into a hash table, and then pulled it out (in its pseudo random order) before doing the sort. Sure enough, a factor of 20 performance improvement resulted (for our application), and a simple bug report went off to Guido. I always liked this story because it was a case where adding wasteful code was VERY helpful.
Note to Grail users: You can visual quicksort (probably not for billion-element inputs) using my visualization applet.

Note: Jim's post about quicksort stuck in my mind, somehow, and I read Josh's post as if it were about quicksort, which doesn't make any sense. One benefit of that gaffe is that Josh referred me to Engineering a Sort Function by Bentley and McIlroy.

No comments: