I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

-1; there’s no standard for UTF-8 BOMs—adding it to the codecs module was probably a mistake to begin with.—M.-A.

►

There is a standard for UTF-8 signatures, however.—Stephen

►

With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings.—"Martin

►

I’d suggest to use the same mode of operation as we have in the UTF-16 codec:—M.-A.

►

I’ve actually been confused about this point for quite some time now, but never had a chance to bring it up.—Nicholas

►

The codec writes a BOM in the first call to .write()—it doesn’t write a BOM before reading from the file.—M.-A.

►

Yes, see, I read a lot of UTF-16 that comes from other sources.—Nicholas

►

Ok, but I don’t really follow you here:—M.-A.

►

This is about what we do now—we catch UnicodeError and then add a BOM to the file, and read it again.—Nicholas

►

Alternatively, the UTF-16BE codec could support the BOM, and do UTF-16LE if the "other" BOM is found.—"Martin

►

I’ve checked the various versions of the Unicode standard docs:—M.-A.

►

There’s only one (corporate) person that matters: Microsoft. —Stephen