I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

-1; there’s no standard for UTF-8 BOMs—adding it to the codecs module was probably a mistake to begin with.—M.-A.

►

There is a standard for UTF-8 signatures, however.—Stephen

►

With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings.—"Martin

►

I’d suggest to use the same mode of operation as we have in the UTF-16 codec: it removes the BOM mark on the first call to the StreamReader .decode() method and writes a BOM mark on the first call to .encode() on a StreamWriter.

Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM.—M.-A.

I’ve started writing such a codec. Making the BOM optional on decoding definitely simplifies the implementation.

Bye,—Walter

OK, here is the patch: <python.org>

The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters.

A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now.

Bye,—Walter

Shouldn’t the decoder be capable of doing a partial match and quitting early? After all, "ab" is encoded in UTF8 as <61> <62> but the BOM is <ef> <bb> <bf>. If it did this type of partial matching, this issue would be avoided except in rare situations.—Evan

This functionality is provided by a flush() method on similar objects, such as the zlib compression objects.

Evan Jones—Evan

This can be improved, of course: If the first byte is "a", it most definitely is not an UTF-8 signature.

So we only need a second byte for the characters between U+F000 and U+FFFF, and a third byte only for the characters U+FEC0...U+FEFF. But with the first byte being \xef, we need three bytes anyway, so we can always decide with the first byte only whether we need to wait for three bytes.—"Martin

OK, I’ve updated the patch so that the first bytes will only be kept in the buffer if they are a prefix of the BOM. —Walter

Shouldn’t an empty read from the underlying stream be taken as an EOF?

Regards, Martin—"Martin

There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report.

Bye,—Walter

Yes, but these are not file-like objects. In the IncrementalParser, it is not the case that a read operation returns an empty string. Instead, the application repeatedly feeds data explicitly. For a file-like object, returning "" indicates EOF.

Regards, Martin—"Martin

I’ve actually been confused about this point for quite some time now, but never had a chance to bring it up. I do not understand why UnicodeError should be raised if there is no BOM. I know that PEP-100 says:

’utf-16’: 16-bit variable length encoding (little/big endian)

and:

Note: ’utf-16’ should be implemented by using and requiring byte order marks (BOM) for file input/output.

But this appears to be in error, at least in the current unicode standard. ’utf-16’, as defined by the unicode standard, is big-endian in the absence of a BOM:

--- 3.10.D42: UTF-16 encoding scheme: ... * The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. ---

The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. I allow for the possibility that this was ambiguous in the standard when the PEP was written, but it is certainly not ambiguous now.—Nicholas

The problem is "in the absence of a higher level protocol": the codec doesn’t know anything about a protocol - it’s the application using the codec that knows which protocol get’s used. It’s a lot safer to require the BOM for UTF-16 streams and raise an exception to have the application decide whether to use UTF-16-BE or the by far more common UTF-16-LE.

Unlike for the UTF-8 codec, the BOM for UTF-16 is a configuration parameter, not merely a signature.

In terms of history, I don’t recall whether your quote was already in the standard at the time I wrote the PEP. You are the first to have reported a problem with the current implementation (which has been around since 2000), so I believe that application writers are more comfortable with the way the UTF-16 codec is currently implemented. Explicit is better than implicit :-)—M.-A.

The codec writes a BOM in the first call to .write() - it doesn’t write a BOM before reading from the file. —M.-A.

Yes, see, I read a lot of UTF-16 that comes from other sources. It’s not a matter of writing with python and reading with python.—Nicholas

Ok, but I don’t really follow you here: you are suggesting to relax the current UTF-16 behavior and to start defaulting to UTF-16-BE if no BOM is present - that’s most likely going to cause more problems that it seems to solve: namely complete garbage if the data turns out to be UTF-16-LE encoded and, what’s worse, enters the application undetected.

If you do have UTF-16 without a BOM mark it’s much better to let a short function analyze the text by reading for first few bytes of the file and then make an educated guess based on the findings. You can then process the file using one of the other codecs UTF-16-LE or -BE.—M.-A.

See above.

Thanks,—M.-A.