I recently rediscovered this strange behaviour in Python’s Unicode handling. I think it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making these minor changes.

    >>> import codecs    >>> codecs.BOM_UTF8.decode( "utf8" )    u’\ufeff’    >>> codecs.BOM_UTF16.decode( "utf16" )    u’’

Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder turns it into a character? The UTF-16 decoder contains logic to correctly handle the BOM. It even handles byte swapping, if necessary. I propose that the UTF-8 decoder should have the same logic: it should remove the BOM if it is detected at the beginning of a string. This will remove a bit of manual work for Python programs that deal with UTF-8 files created on Windows, which frequently have the BOM at the beginning. The Unicode standard is unclear about how it should be handled (version 4, section 15.9):

    Although there are never any questions of byte order with UTF-8 text,     this sequence can serve as signature for UTF-8 encoded text where the     character set is unmarked. [...] Systems that use the byte order mark     must recognize when an initial U+FEFF signals the byte order. In those     cases, it is not part of the textual content and should be removed     before processing, because otherwise it may be mistaken for a     legitimate zero width no-break space.

At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic:

    def autodecode( s ):        if s.beginswith( codecs.BOM_UTF8 ):            # The byte string s is UTF-8            out = s.decode( "utf8" )            return out[1:]        else: return s.decode( "ascii" )

As a second issue, the UTF-16LE and UTF-16BE encoders almost do the right thing: They turn the BOM into a character, just like the Unicode specification says they should.

    >>> codecs.BOM_UTF16_LE.decode( "utf-16le" )    u’\ufeff’    >>> codecs.BOM_UTF16_BE.decode( "utf-16be" )    u’\ufeff’

However, they also incorrectly handle the reversed byte order mark:

    >>> codecs.BOM_UTF16_BE.decode( "utf-16le" )    u’\ufffe’

This is not a valid Unicode character. The Unicode specification (version 4, section 15.8) says the following about non-characters:

    Applications are free to use any of these noncharacter code points     internally but should never attempt to exchange them. If a     noncharacter is received in open interchange, an application is not     required to interpret it in any way. It is good practice, however, to     recognize it as a noncharacter and to take appropriate action, such as     removing it from the text. Note that Unicode conformance freely allows     the removal of these characters. (See C10 in Section3.2, Conformance     Requirements.)

My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Similarly, both of the following lines should also result in a zero length Unicode string:

    >>> ’\xff\xfe\xfe\xff’.decode( "utf16" )    u’\ufffe’    >>> ’\xff\xfe\xff\xff’.decode( "utf16" )    u’\uffff’

Thanks for your feedback,

Evan Jones—Evan

The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files).

It is not needed for the UTF-8 because that format doesn’t rely on the byte order and the BOM character at the beginning of a stream is a legitimate ZWNBSP (zero width non breakable space) code point.

The "utf-16" codec detects and removes the mark, while the two others "utf-16-le" (little endian byte order) and "utf-16-be" (big endian byte order) don’t.—M.-A.

 
 

-1; there’s no standard for UTF-8 BOMs - adding it to the codecs module was probably a mistake to begin with. You usually only get UTF-8 files with BOM marks as the result of recoding UTF-16 files into UTF-8.—M.-A.

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Well, I’d say that’s a very English way of dealing with encoded text ;-)

BTW, how do you know that s came from the start of a file and not from slicing some already loaded file somewhere in the middle ?—M.-A.

 
 
 
 

Hmm, wouldn’t it be better to raise an error ? After all, a reversed BOM mark in the stream looks a lot like you’re trying to decode a UTF-16 stream assuming the wrong byte order ?!

Other than that: +1 on fixing this case.—M.-A.

 
 
 

Please note I am saying only that something like this may want to me considered for addition to the documentation, and not to the Python standard library. This example function more closely replicates the logic that is used on those Windows applications when opening ".txt" files. It uses the default locale if there is no BOM:

    def autodecode( s ):        if s.beginswith( codecs.BOM_UTF8 ):            # The byte string s is UTF-8            out = s.decode( "utf8" )            return out[1:]        else: return s.decode()—Evan

Well, either one is possible, however the Unicode standard suggests, but does not require, silently removing them:

  It is good practice, however, to recognize it as a noncharacter and to   take appropriate action, such as removing it from the text. Note that   Unicode conformance freely allows the removal of these characters.

I would prefer silently ignoring them from the str.decode() function, since I believe in "be strict in what you emit, but liberal in what you accept." I think that this only applies to str.decode(). Any other attempt to create non-characters, such as unichr( 0xffff ), should raise an exception because clearly the programmer is making a mistake.—Evan