I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

-1; there’s no standard for UTF-8 BOMs—adding it to the codecs module was probably a mistake to begin with.—M.-A.

►

There is a standard for UTF-8 _signatures_, however. I don’t have the most recent version of the ISO-10646 standard, but Amendment 2 (which defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to Annex F of that standard. Evan quotes Version 4 of the Unicode standard, which explicitly defines the UTF-8 signature.

So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python’s codecs shouldn’t produce it (by default), providing an option to strip is a good idea.

However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode).—Stephen

I would personally like to see an "utf-8-bom" codec (perhaps better named "utf-8-sig", which strips the BOM on reading (if present) and generates it on writing.—"Martin

+1. —M.-A.

With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application’s choice.—"Martin

Ok, as signature the BOM does make some sense - whether to strip signatures from a document is a good idea or not is a different matter, though.

Here’s the Unicode Cons. FAQ on the subject:

<unicode.org>

They also explicitly warn about adding BOMs to UTF-8 data since it can break applications and protocols that do not expect such a signature.—M.-A.

Right. —M.-A.