I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files).

It is not needed for the UTF-8 because that format doesn’t rely on the byte order and the BOM character at the beginning of a stream is a legitimate ZWNBSP (zero width non breakable space) code point.

The "utf-16" codec detects and removes the mark, while the two others "utf-16-le" (little endian byte order) and "utf-16-be" (big endian byte order) don’t.—M.-A.

Well, it’s origins do not really matter since at this point the BOM is firmly encoded in the Unicode standard. It seems to me that it is in everyone’s best interest to support it.—Evan

You are correct: it is a legitimate character. However, its use as a ZWNBSP character has been deprecated:

The overloading of semantics for this code point has caused problems for programs and protocols. The new character U+2060 WORD JOINER has the same semantics in all cases as U+FEFF, except that it cannot be used as a signature. Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics is intended.

Also, the Unicode specification is ambiguous on what an implementation should do about a leading ZWNBSP that is encoded in UTF-16. Like I mentioned, if you look at the Unicode standard, version 4, section 15.9, it says:

2. Unmarked Character Set. In some circumstances, the character set information for a stream of coded characters (such as a file) is not available. The only information available is that the stream contains text, but the precise character set is not known.

This seems to indicate that it is permitted to strip the BOM from the beginning of UTF-8 text.—Evan