I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

-1; there’s no standard for UTF-8 BOMs—adding it to the codecs module was probably a mistake to begin with.—M.-A.

►

There is a standard for UTF-8 signatures, however.—Stephen

►

With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings.—"Martin

►

I’d suggest to use the same mode of operation as we have in the UTF-16 codec:—M.-A.

►

I’ve started writing such a codec. Making the BOM optional on decoding definitely simplifies the implementation.—Walter

►

OK, here is the patch:—Walter

►

This can be improved, of course: If the first byte is "a", it most definitely is not an UTF-8 signature.

So we only need a second byte for the characters between U+F000 and U+FFFF, and a third byte only for the characters U+FEC0...U+FEFF. But with the first byte being \xef, we need three bytes anyway, so we can always decide with the first byte only whether we need to wait for three bytes.—"Martin

OK, I’ve updated the patch so that the first bytes will only be kept in the buffer if they are a prefix of the BOM. —Walter