I recently rediscovered this strange behaviour in Python’s Unicode handling. I think it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making these minor changes.
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder turns it into a character? The UTF-16 decoder contains logic to correctly handle the BOM. It even handles byte swapping, if necessary. I propose that the UTF-8 decoder should have the same logic: it should remove the BOM if it is detected at the beginning of a string. This will remove a bit of manual work for Python programs that deal with UTF-8 files created on Windows, which frequently have the BOM at the beginning. The Unicode standard is unclear about how it should be handled (version 4, section 15.9):
At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic:
As a second issue, the UTF-16LE and UTF-16BE encoders almost do the right thing: They turn the BOM into a character, just like the Unicode specification says they should.
However, they also incorrectly handle the reversed byte order mark:
This is not a valid Unicode character. The Unicode specification (version 4, section 15.8) says the following about non-characters:
My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Similarly, both of the following lines should also result in a zero length Unicode string:
Thanks for your feedback, Evan Jones |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|