I recently rediscovered this strange behaviour in Python’s Unicode handling.
—Evan
►
-1; there’s no standard for UTF-8 BOMs—adding it to the codecs module was probably a mistake to begin with.
—M.-A.
►
There is a standard for UTF-8
signatures
, however.
—Stephen
►
With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings.
—"Martin
►
I’d suggest to use the same mode of operation as we have in the UTF-16 codec:
—M.-A.
►
I’ve actually been confused about this point for quite some time now, but never had a chance to bring it up.
—Nicholas
►
The codec writes a BOM in the first call to .write()—it doesn’t write a BOM before reading from the file.
—M.-A.
►
Yes, see, I read a
lot
of UTF-16 that comes from other sources.
—Nicholas
►
Ok, but I don’t really follow you here:
—M.-A.
►
This is about what we do now—we catch UnicodeError and then add a BOM to the file, and read it again.
—Nicholas
►
Alternatively, the UTF-16BE codec could support the BOM, and do UTF-16LE if the "other" BOM is found.
—"Martin
►
I’ve checked the various versions of the Unicode standard docs:
—M.-A.
►
There’s
only
one
(corporate)
person
that
matters:
Microsoft.
—Stephen