I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

Well, I’d say that’s a very English way of dealing with encoded text ;-)

BTW, how do you know that s came from the start of a file and not from slicing some already loaded file somewhere in the middle ?—M.-A.

Well, the same argument could be applied to the UTF-16 decoder know that the string came from the start of a file, and not from slicing some already loaded file? The standard states that:

In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file or stream explicitly signals the byte order.

So it is perfectly permissible to perform this type of processing if you consider a string to be equivalent to a stream.—Evan

The programmer or the application might, but Python’s codecs don’t. The point is that this is also true of rawstrings that happen to contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec shouldn’t strip leading BOMs either, unless it has been told it has the beginning of the string.—Stephen

The UTF-16 stream codecs implement this logic.

The UTF-16 encode and decode functions will however always strip the BOM mark from the beginning of a string.

If the application doesn’t want this stripping to happen, it should use the UTF-16-LE or -BE codec resp.—M.-A.

That sounds like it would work fine almost all the time. If it doesn’t it’s straightforward to work around, and certainly would be more convenient for the non-standards-geek programmer.—Stephen