I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

Hmm, wouldn’t it be better to raise an error ? After all, a reversed BOM mark in the stream looks a lot like you’re trying to decode a UTF-16 stream assuming the wrong byte order ?!

Other than that: +1 on fixing this case.—M.-A.

Cool!

Evan Jones—Evan

+1 on (optionally) raising an error. -1 on removing it or anything like that, unless under control of the application (ie, the program written in Python, not Python itself). It’s far too easy for software to generate broken Unicode streams[1], and the choice of how to deal with those should be with the application, not with the implementation language.

Footnotes: [1] An egregious example was the Outlook Express distributed with early Win2k betas, which produced MIME bodies with apparent Content-Type: text/html; charset=utf-16, but the HTML tags and newlines were 7-bit ASCII!—Stephen

The advantage of raising an error is that the application can deal with the situation in whatever way seems fit (by registering a special error handler or by simply using "ignore" or "replace").

I agree that much of this lies outside the scope of codecs and should be handled at an application or protocol level.—M.-A.