I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

Not exactly. I think that at the lowest level codecs should not implement complex mode-switching internally, but rather explicitly abdicate responsibility to a more appropriate codec.

For example, autodetecting UTF-16 on input would be implemented by a Python program that does something like

    data = stream.read()    for detector in [ "utf-16-signature", "utf-16-statistical" ]:        # for the UTF-16 detectors, OUT will always be u"" or None        out, data, codec = data.decode(detector)        if codec: break    while codec:        more_out, data, codec = data.decode(codec)        out = out + more_out    if data:        # a real program would complain about it        pass    process(out)

where decode("utf-16-signature") would be implemented

    def utf-16-signature-internal (data):        if data[0:2] == "\xfe\xff":            return (u"", data[2:], "utf-16-be")        else if data[0:2] == "\xff\xfe":            return (u"", data[2:], "utf-16-le")        else            # note: data is undisturbed if the detector fails            return (None, data, None)

The main point is that the detector is just a codec that stops when it figures out what the next codec should be, touches only data that would be incorrect to pass to the next codec, and leaves the data alone if detection fails. utf-16-signature only handles the BOM (if present), and does not handle arbitrary "chunks" of data. Instead, it passes on the rest of the data (including the first chunk) to be handled by the appropriate utf-16-?e codec.

I think that the temptation to encapsulate this logic in a utf-16 codec that "simplifies" things by calling the appropriate utf-16-?e codec itself should be deprecated, but YMMV. What I would really like is for the above style to be easier to achieve than it currently is.

BTW, I appreciate your patience in exploring this; after Martin’s remark about different mental models I have to suspect this approach is just somehow un-Pythonic, but fleshing it out this way I can see how it will be useful in the context of a different project.—Stephen