Sunday, August 27, 2006

Python tokenize module

The tokenize module generates a list of token tuples from Python source. The expression 1 + 1 would yield the following tokens:
NUMBER      '1'           (1, 0) (1, 1)
OP '+' (1, 2) (1, 3)
NUMBER '1' (1, 4) (1, 5)
The first element of the tuple is the token type. The builtin Python tokenizer defines many token types, including NUMBER, NAME, STRING, and PLUS. The tokenize module uses some of those tokens, but, for example, generates a generic OP instead of PLUS in the example above. The next element is the token itself, a string. The next two elements are tuples that describe the start and end positions of the token in the original source.

The untokenize function takes a sequence of these tokens and return a string of program source. I use this to generate source code for transformed Python programs. It mostly works, excep that tokenize does not generate tokens for continuation markers--the backslash that is used to indicate when a newline does not end a statement. I need to change tokenize to generate this token.

Martin and I also fixed a bug in tokenize in the handling of newlines following comments. One good question is how many other bugs remain in tokenize. We could test it more thoroughly
by running it over a large body of Python code and comparing it to the output of the parser module. You would have to extract the tokens from the parse tree and do some conversions, like PLUS to OP.

The other major change is for untokenize. It needs to be able to emit code for a mix of tokens with positions and tokens without positions. When code transformation takes place, it is impractical to compute new position information for the new tokens (or those that have been moved around). It would be incredibly tedious and no one would ever do it. Instead, we'll have to trust untokenize to do the right thing when it encounters a token with missing positions. It currently emits code for tokens without any position information, but the results are nearly unreadable, and it does not handle a mix of tokens with and without positions.

No comments: