Tuesday, March 31, 2009

Python strings and bytes

At the PyCon sprints, we looked into a lot of bugs in the standard library caused by interactions between strings and bytes.  (A string holds a sequence of characters.  A bytes object holds a sequence of bytes, e.g. 0-255.)  I help maintain httplib and urllib, which read raw bytes from a socket and often convert them into strings.  The details of those conversions are sometimes tricky.  The rules for strings and bytes changed drastically in Python 3.0.  Most of the standard library was converted from old to new automatically (by 2to3), and many of the times those conversions were incorrect.

A harmless example comes from httplib where an if / elif statement had tests from strings and for unicode strings.  They were both converted to test for strings by the conversion tool.  The code looked like this:

    if isinstance(buf, str):  # regular strings

        # do something
    elif isinstance(buf, str):  # unicode strings
        # do something else

In this case, the second branch could be deleted.  In other cases, the effects were harmful.  If you passed a bytes object as the body argument in an HTTP request--passing form params for a POST reply is a common case--the bytes object would be converted via str() to a string.

    >>> body = b"key=value"
    >>> str(body)
    "b'key=value'"

That is, str() uses repr() to convert bytes to a string.  That's simplfy incorrect.

It will take a long time to sort out all of these problems.  We don't have a lot of experience from application developers who are using Python 3.0, so we have to invent solutions as we go along.  We're likely to make mistakes or at least make sub-optimal API decisions.

I can of think of two things that would help us make progress. 

First, we ought to organize a systematic effort to review the standard library.  How many of the libraries have plausible tests that exercise strings and bytes?  For example, the json library was carefully tested with strings and unicode in Python 2.x.  Those have all been converted to strings, so now we have a thorough set of tests for strings and none at all for bytes.

Second, we need to collect a set of best practices for writing libraries that support bytes and unicode.  A typical pattern is that bytes get sent on the wire.  (Wires, almost by definition, send bytes.)  The applications that use the wire usually want to deal with strings, which means they need to have some way to specify an encoding to use when send to or read from the wire.  We could start by collecting all the patches and bug fixes that have gone into Python 3.1 to fix string and bytes problems with 3.0.

1 comment:

Kevin said...

I am using the python json library to get json content which is returned in bytes. But as I convert it to string using str(), I face the problem that you mentioned above.

So the json.loads cannot directly work on it.