I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

-1; there’s no standard for UTF-8 BOMs—adding it to the codecs module was probably a mistake to begin with.—M.-A.

►

There is a standard for UTF-8 signatures, however.—Stephen

►

With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings.—"Martin

►

I’d suggest to use the same mode of operation as we have in the UTF-16 codec:—M.-A.

►

I’ve actually been confused about this point for quite some time now, but never had a chance to bring it up. I do not understand why UnicodeError should be raised if there is no BOM. I know that PEP-100 says:

’utf-16’: 16-bit variable length encoding (little/big endian)

and:

Note: ’utf-16’ should be implemented by using and requiring byte order marks (BOM) for file input/output.

But this appears to be in error, at least in the current unicode standard. ’utf-16’, as defined by the unicode standard, is big-endian in the absence of a BOM:

--- 3.10.D42: UTF-16 encoding scheme: ... * The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. ---

The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. I allow for the possibility that this was ambiguous in the standard when the PEP was written, but it is certainly not ambiguous now.—Nicholas

The problem is "in the absence of a higher level protocol": the codec doesn’t know anything about a protocol - it’s the application using the codec that knows which protocol get’s used. It’s a lot safer to require the BOM for UTF-16 streams and raise an exception to have the application decide whether to use UTF-16-BE or the by far more common UTF-16-LE.

Unlike for the UTF-8 codec, the BOM for UTF-16 is a configuration parameter, not merely a signature.

In terms of history, I don’t recall whether your quote was already in the standard at the time I wrote the PEP. You are the first to have reported a problem with the current implementation (which has been around since 2000), so I believe that application writers are more comfortable with the way the UTF-16 codec is currently implemented. Explicit is better than implicit :-)—M.-A.

The codec writes a BOM in the first call to .write() - it doesn’t write a BOM before reading from the file. —M.-A.

Yes, see, I read a lot of UTF-16 that comes from other sources. It’s not a matter of writing with python and reading with python.—Nicholas

Ok, but I don’t really follow you here: you are suggesting to relax the current UTF-16 behavior and to start defaulting to UTF-16-BE if no BOM is present - that’s most likely going to cause more problems that it seems to solve: namely complete garbage if the data turns out to be UTF-16-LE encoded and, what’s worse, enters the application undetected.

If you do have UTF-16 without a BOM mark it’s much better to let a short function analyze the text by reading for first few bytes of the file and then make an educated guess based on the findings. You can then process the file using one of the other codecs UTF-16-LE or -BE.—M.-A.

The crux of my argument is that the spec declares that UTF-16 without a BOM is BE. If the file is encoded in UTF-16LE and it doesn’t have a BOM, it doesn’t deserve to be processed correctly. That being said, treating it as UTF-16BE if it’s LE will result in a lot of invalid code points, so it shouldn’t be non-obvious that something has gone wrong.—Nicholas

This is about what we do now - we catch UnicodeError and then add a BOM to the file, and read it again. We know our files are UTF-16BE if they don’t have a BOM, as the files are written by code which observes the spec. We can’t use UTF-16BE all the time, because sometimes they’re UTF-16LE, and in those cases the BOM is set.

It would be nice if you could optionally specify that the codec would assume UTF-16BE if no BOM was present, and not raise UnicodeError in that case, which would preserve the current behaviour as well as allow users’ to ask for behaviour which conforms to the standard.

I’m not saying that you can’t work around the issue now, what I’m saying is that you shouldn’t have to - I think there is a reasonable expectation that the UTF-16 codec conforms to the spec, and if you wanted it to do something else, it is those users who should be forced to come up with a workaround.—Nicholas

See above.

Thanks,—M.-A.