Thursday, August 31, 2006

Barry would take it the hardest

When I heard the news, I knew Barry would take it the hardest. I can always recognize his code by the <> comparisons and the ^Ls between classes.

Monday, August 28, 2006

Future Python parsing strategy?

ANTLR 3.0 generates code for plain old C and someone has already written a Python grammar. I should learn enough about ANTLR to figure out if it could generate an AST for Python directly. It's certainly a richer toolkit.

Sunday, August 27, 2006

An example code transformation

I wrote a simple transformer that works for a trivial case. It demonstrates the basic structure of a transformer and highlights some hard issues still to resolve. It is much too hard to generate the AST nodes (and corresponding tokens) for the newly generated code.

Let's start with the easy part. The HasKeyTransformer looks has all calls of a has_key attribute. (There should be better type analysis to try to cope with cases where there is a non-dict with a has_key() method.) It starts with an in-order walk of the tree looking for has_key() calls. The search() method performs the search, defering to the is_has_key_call() predicate to test each node. The predicate looks for a Call node, where the function being called is an attribute of another object; i.e. the call is of the form x.y(...). If the y attribute is has_key, then the predicate returns true.
import concrete

class HasKeyTransformer:

def search(self, tree):
for node in concrete.TreeIter(tree):
if self.is_has_key_call(node):
yield node

def is_has_key_call(self, node):
if not isinstance(node, _ast.Call):
return False
callee = node.func
if not isinstance(callee, _ast.Attribute):
return False
return callee.attr == "has_key"
The next step is to replace the Call node with a Compare node, using the object with the has_key method as the right-hand side of an "in" comparison. The new Compare node can re-use some of the old tokens and AST nodes but needs to generate its own "in" token. In some cases, it may need to add parentheses, too. The code turns ugly here.
    def replace(self, node):
# Replace the Call node with a Compare node. The
# base of the callee becomes the RHS and one argument
# in the call becomes the LHS.

the_dict = node.func.value
the_key = node.args[0]
strip_positions(the_key)

new = _ast.Compare()
new.left = the_key
the_in = _ast.In()

# We need to synthesize a full token for "in". Yuck!
the_in.concrete = [(3, "in")]
new.ops = [the_in]
new.comparators = [the_dict]
new.concrete = [new.left] + new.ops + new.comparators
return new
The strip_positions() function removes the old position information from all the tokens representing the key. We are moving the key to the left of "in," so none of those positions are correct. We'll have to count on untokenize to do something reasonable. We ought to be able to preserve relative spacing, but that's a detail. It's hard to hook up the new node, because we need to fill in the concrete list and all the other node-specific attributes. It should be possible to generate helper functions that take the concrete tokens and nodes and hook up all the attributes correctly. The code will look better then. We could also have token helper functions so that you can avoid writing tuples.
    def replace(self, node):
# Replace the Call node with a Compare node. The
# base of the callee becomes the RHS and one argument
# in the call becomes the LHS.

the_dict = node.func.value
the_key = node.args[0]
strip_positions(the_key)

return concrete.Compare(the_key, [token.STRING("in")], [the_dict])

Python tokenize module

The tokenize module generates a list of token tuples from Python source. The expression 1 + 1 would yield the following tokens:
NUMBER      '1'           (1, 0) (1, 1)
OP '+' (1, 2) (1, 3)
NUMBER '1' (1, 4) (1, 5)
The first element of the tuple is the token type. The builtin Python tokenizer defines many token types, including NUMBER, NAME, STRING, and PLUS. The tokenize module uses some of those tokens, but, for example, generates a generic OP instead of PLUS in the example above. The next element is the token itself, a string. The next two elements are tuples that describe the start and end positions of the token in the original source.

The untokenize function takes a sequence of these tokens and return a string of program source. I use this to generate source code for transformed Python programs. It mostly works, excep that tokenize does not generate tokens for continuation markers--the backslash that is used to indicate when a newline does not end a statement. I need to change tokenize to generate this token.

Martin and I also fixed a bug in tokenize in the handling of newlines following comments. One good question is how many other bugs remain in tokenize. We could test it more thoroughly
by running it over a large body of Python code and comparing it to the output of the parser module. You would have to extract the tokens from the parse tree and do some conversions, like PLUS to OP.

The other major change is for untokenize. It needs to be able to emit code for a mix of tokens with positions and tokens without positions. When code transformation takes place, it is impractical to compute new position information for the new tokens (or those that have been moved around). It would be incredibly tedious and no one would ever do it. Instead, we'll have to trust untokenize to do the right thing when it encounters a token with missing positions. It currently emits code for tokens without any position information, but the results are nearly unreadable, and it does not handle a mix of tokens with and without positions.

Saturday, August 26, 2006

Python 3000 Translation Strategy

The basic strategy for transforming Python programs to run on Python 3000 is to parse the code to generate an abstract syntax tree (AST), write each transformation as a separate pass over the AST, and emit the modified AST as source code. The real challenge is to preserve the formatting and comments of the original program in the transformed program. We addressed most of the major conceptual issues during the sprint, but there is a lot of work to do to flesh out the prototype.

A typical parser discards information about whitespace and comments at its lowest levels. It extracts only the program tokens. (In Python, those tokens do include indent and dedent tokens, but comments are completely ignored.) The AST discards even more of the concrete text of the program. There are no parentheses for grouping expressions and describing intended operator precedence. The Python parser does record the row and column of the start of each token, but this information is not present in the AST.

The first decision was to decorate the AST with information about the concrete tokens that represent that node. Each decorated AST node has a concrete attribute that is a list of tokens and AST nodes, in the order they occur. The expression (2 + 3) would have this sequence as its concrete attribute: [LPAREN, Num, Op(+), Num, RPAREN]. The effect is to create a second parallel tree. You can iterate over the standard AST nodes and fields or you can iterate over concrete. An in-order walk of the concrete tree yields all of the raw tokens in order.

It's not entirely clear how to generate the concrete tokens for the AST. There are several options available, but the right one is to modify the Python parser to record this information if the client asks for it. Then modify the concrete-to-abstract transformation phase to pass it along. It may be the right thing, but it involves modifying all sorts of C code, including the lowest levels of the parser (which I don't understand well) and the AST generation code (which I understand too well). If we worked on that rewrite, we wouldn't have made any substantial progress during the sprint.

I decided to use the tokenize module to extract the tokens with full position information and write a small Python program to attach the tokens to the AST. The tokenize modules uses a set of regular expressions and other hacks to tokenize Python source. For each token, it records the start and stop positions (row number, column offset). It's possible to regenerate the exact source for these tokens in all but a few cases. One problem here is that you wouldn't know whether the source used tabs or spaces.

The tokens-to-AST matcher program is not terribly complicated, because we already have a parse tree! You perform an in-order walk of the AST and consume tokens during the walk. The only real problem is writing down all the rules for which tokens an AST node should consume. An if statement for example, will always consume an 'if' token, followed by some arbitrary expression tokens (handled by its child expression node), and a colon token. The rest of the rule is more complicated, because the else part of the statement is optional and the bodies of the then and else branches are of variable length. We need to generate a set of token-matching rules conditioned on the specific AST, e.g. match an 'else' token if there is an orelse attribute on the AST.

More details to follow. The actual code is in the concrete module in the refactor sandbox.

Sprint Report

I worked on a source-to-source translation system at the Google Python sprint. The system is intended to allow us to write transformations that upgrade programs from Python 2.x to Python 3.x. I finished a toy example that demonstrated that the approach will work. I translated a small program that use has_key to an equivalent program that uses in.
d = dict(x=1)
if d.has_key("x"):
print "yes"
else:
print "no"
Is transformed to
d = dict(x=1)
if "x" in d :
print "yes"
else:
print "no"
(Yes. There are 13 spaces between the d and the :.) The code is checked into the Python sandbox under refactor. I had a bunch of helpful conversations and did a little pair programming with Martin von Loewis.

Guido wrote a summary post about the Python 3000 sprint work. Update: I fleshed out many of the details in several subsequent posts.

IPod Data

My IPod holds a lot of data about what music I like. It's driven in large part by the ratings I provide, but it also records what I actually listen to. I looked at the play counts today to see what my favorite albums are (as opposed to what I think they are).

I know Sunday at the Village Vanguard, Waltz for Debby, and Getz/Gilberto would be at the top of the list, because I listen to them most mornings on the bus. I'm looking for something mellow then.

Top Jazz
  1. Charlie Parker, The Legendary Dial Masters
  2. Paul Motian, I Have the Room Above Her
  3. Madeleine Peyroud, Careless Love
  4. Tommy Flanagan, Sea Changes
  5. Tommy Flanagan, For Bird, Monk, Train, and Thad
Top Rock
  1. Fiona Apple, Extraordinary Machine
  2. Husker Du, Zen Arcade
  3. U2, Achtung Baby
  4. R.E.M., Murmur
  5. Modest Mouse, Good News for People Who Like Bad News
If I count by album, R.E.M. comes in with the #4 album, but I've got lots of R.E.M. albums and I listen to all of them. You can count all the songs played by artist.

Top Rock Artists
  1. R.E.M.
  2. Fiona Apple
  3. U2
  4. The Replacements
  5. The Pixies

Wednesday, August 23, 2006

Django Pronouncement

I've got nothing against Django, but I do find Guido's pronouncement odd. What's the point? Does Python need an official caffeinated beverage, too? (Clarification: I like Django.)

Greg Wilson has grumbled about the lack of a Python response to Ruby on Rails before, and I didn't find his arguments compelling then either. Ruby on Rails was the right tool at the right time--at least a good tool. It didn't become popular because a committee or dictator decided it would be popular. It became popular because people liked it. You can't decide to make a specific Python tool popular by fiat.

I've read a few good blog posts on the subject. Christopher Lenz makes some credible criticisms of Django. The problems he describe sound like the sorts of problems that Zope ran into years ago. Using Python as your configuration language? Yikes. (Almost was bad as using ZCML <0.5>.) Speaking of Zope, cguardia asks why Zope doesn't count in Greg's list of web frameworks. Ian Bicking has a better explanation of the problem.
Ian's comment on Greg's blog: Getting a good story together for new users is important. That story probably should not start with “analyzing what framework is best for you.”
Regardless of who the winner is or why you would care, the Django team should make sure it understands the lessons Zope did learn about de-coupling components, configuration, and separating policy and mechanism. Update: Adrian Holovaty address my concerns in one of the comments. They make not like Zope's solutions to these problems, but Lenz's post made me wonder if they appreciated the problems yet.

Python Sprint Notes

We've finished three days of sprinting. I wanted to capture a few brief observations about the sprint. I've got a separate post underway that gets into the details of the source-to-source translation project.

Guido and Alex are working on the removal of the original 3-way compare function. In Python 3000, rich comparisons will be the only available comparison. The other key comparison change, already planned, is that comparison of unlike types will raise an exception by default. (Python 2.x defines an arbitrary total ordering, such that there is some way to sort a list containing numbers, sockets, and code objects.)

The 3-way compare shows up in several places in Python 2.x. The builtin cmp(a, b) function returns a number that indicates whether a is less than, equal to, or greater than b. A class can implement an __cmp__() (and __rcmp__()?) method to define how it compares to other objects. You can also pass a 3-way compare function to list.sort(). Good bye to all of them.

One of the last straws for Alex was this anomaly: When is it true that

   (x != y) and ([x] == [y])
?

When x and y are NaNs! There's no easy way to fix this anomaly, because of the bits of comparison logic scattered through the interpreter.
>>> [x] == [y]
True
>>> x == y
False
>>> x
-1.#IND
>>> y
-1.#IND
Martin is working on unifying ints and longs. He is taking a perhaps unobvious approach: Remove all the int code and see if the long code can be optimized to make small longs fast enough. An unexpected issue in this work was that bool inherits from int! Martin also changed
bool so that it is no longer a number.

More video conference comments

The sprint has been quieter than I expected. We have had a few conversations between locations, but the quality isn't quite good enough. The biggest problem is the microphones. The rooms we are using are big and it's hard to hear someone who isn't intentionally talking into the microphone. The Mountain View room is a large classroom, where you must hold or wear the microphone to be heard in New York. We may do better if we get a conference room that is intended for meetings rather than classes. Despite the problems, it has been really handy to be able to talk on occasion.

Monday, August 21, 2006

Py3k Sprint Project

Guido suggested a Python 3000 project for me, albeit one that will be done mostly in Python 2.5. I'm going to work on tools for analyzing and refactoring Python 2.x code so that it runs in Python 3.x. To pick one example: Uses of mydict.has_key(akey) needs to be changes to "akey in mydict" (or mydict.__contains__(akey) in some odd cases).

I'm not sure that there is a lot of existing code to work from. I believe Bicycle Repair Man may handle some of these issues, but the SourceForge site is unreachable right now. My plan is to start with a simple tool to parse a Python source file, build an AST annotated with token information, and rewrite the same code (preserving whitespaces, comments, &c). This minimal functionality will be useful for general source-to-source transformations. It will probably take a day or two to get it working.

Video conference hiccups

It's after lunch here in New York, and we're getting the kinks worked out of the videoconference system. We've a Tandberg Meastro on our end--a high-end system with two screens, one for the far-end camera and one for slides. It's just tricky to get all the screens configured and the microphones working on both ends. Once we got everything working, it's a pretty good deal. We can see and hear everyone in Mountain View clearly.

Guido is going through the Python 3000 sprint wiki to discuss potential tasks. Neal, Thomas, and I got started on some 2.x projects this morning, but I'll probably switch to Py3k this afternoon.

Google Python sprint underway

Neal Norwitz, Thomas Wouters, and I are sitting on the 16th floor at 1440 Broadway New York for the Google Python Sprint. It's a beautiful day and there is lots of light coming in from 42nd St. I'm going to start with some simple restructuring of compile.c, breaking it out into its natural components. Most of the sprinters are in Mountain View. We'll be joining them by video conference in another hour or so.