I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

Well,—M.-A.

►

The programmer or the application might, but Python’s codecs don’t. The point is that this is also true of rawstrings that happen to contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec shouldn’t strip leading BOMs either, unless it has been told it has the beginning of the string.—Stephen

The UTF-16 stream codecs implement this logic.

The UTF-16 encode and decode functions will however always strip the BOM mark from the beginning of a string.

If the application doesn’t want this stripping to happen, it should use the UTF-16-LE or -BE codec resp.—M.-A.

That sounds like it would work fine almost all the time. If it doesn’t it’s straightforward to work around, and certainly would be more convenient for the non-standards-geek programmer.—Stephen