Saturday, August 26, 2006

Python 3000 Translation Strategy

The basic strategy for transforming Python programs to run on Python 3000 is to parse the code to generate an abstract syntax tree (AST), write each transformation as a separate pass over the AST, and emit the modified AST as source code. The real challenge is to preserve the formatting and comments of the original program in the transformed program. We addressed most of the major conceptual issues during the sprint, but there is a lot of work to do to flesh out the prototype.

A typical parser discards information about whitespace and comments at its lowest levels. It extracts only the program tokens. (In Python, those tokens do include indent and dedent tokens, but comments are completely ignored.) The AST discards even more of the concrete text of the program. There are no parentheses for grouping expressions and describing intended operator precedence. The Python parser does record the row and column of the start of each token, but this information is not present in the AST.

The first decision was to decorate the AST with information about the concrete tokens that represent that node. Each decorated AST node has a concrete attribute that is a list of tokens and AST nodes, in the order they occur. The expression (2 + 3) would have this sequence as its concrete attribute: [LPAREN, Num, Op(+), Num, RPAREN]. The effect is to create a second parallel tree. You can iterate over the standard AST nodes and fields or you can iterate over concrete. An in-order walk of the concrete tree yields all of the raw tokens in order.

It's not entirely clear how to generate the concrete tokens for the AST. There are several options available, but the right one is to modify the Python parser to record this information if the client asks for it. Then modify the concrete-to-abstract transformation phase to pass it along. It may be the right thing, but it involves modifying all sorts of C code, including the lowest levels of the parser (which I don't understand well) and the AST generation code (which I understand too well). If we worked on that rewrite, we wouldn't have made any substantial progress during the sprint.

I decided to use the tokenize module to extract the tokens with full position information and write a small Python program to attach the tokens to the AST. The tokenize modules uses a set of regular expressions and other hacks to tokenize Python source. For each token, it records the start and stop positions (row number, column offset). It's possible to regenerate the exact source for these tokens in all but a few cases. One problem here is that you wouldn't know whether the source used tabs or spaces.

The tokens-to-AST matcher program is not terribly complicated, because we already have a parse tree! You perform an in-order walk of the AST and consume tokens during the walk. The only real problem is writing down all the rules for which tokens an AST node should consume. An if statement for example, will always consume an 'if' token, followed by some arbitrary expression tokens (handled by its child expression node), and a colon token. The rest of the rule is more complicated, because the else part of the statement is optional and the bodies of the then and else branches are of variable length. We need to generate a set of token-matching rules conditioned on the specific AST, e.g. match an 'else' token if there is an orelse attribute on the AST.

More details to follow. The actual code is in the concrete module in the refactor sandbox.

No comments: