I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

-1; there’s no standard for UTF-8 BOMs - adding it to the codecs module was probably a mistake to begin with. You usually only get UTF-8 files with BOM marks as the result of recoding UTF-16 files into UTF-8.—M.-A.

This is clearly incorrect. The UTF-8 is specified in the Unicode standard version 4, section 15.9:

In UTF-8, the BOM corresponds to the byte sequence <EF BB BF>.

I normally find files with UTF-8 BOMs from many Windows applications when you save a text file as UTF8. I think that Notepad or WordPad does this, for example. I think UltraEdit also does the same thing. I know that Scintilla definitely does.—Evan

The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds them to existing UTF-8 files lacking them. —Stephen

Is that a MS application ? AFAIK, notepad, wordpad and MS Office always use UTF-16-LE + BOM when saving text as "Unicode text". —M.-A.

Yes, it is an MS application. I’ll have to borrow somebody’s box to check, but IIRC UTF-8 is the native "text" encoding for Japanese now. (Japanized applications generally behave differently from everything else, as there are so many "standards" for encoding Japanese.)—Stephen

There is a standard for UTF-8 _signatures_, however. I don’t have the most recent version of the ISO-10646 standard, but Amendment 2 (which defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to Annex F of that standard. Evan quotes Version 4 of the Unicode standard, which explicitly defines the UTF-8 signature.

So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python’s codecs shouldn’t produce it (by default), providing an option to strip is a good idea.

However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode).—Stephen