I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

This is about what we do now - we catch UnicodeError and then add a BOM to the file, and read it again. We know our files are UTF-16BE if they don’t have a BOM, as the files are written by code which observes the spec. We can’t use UTF-16BE all the time, because sometimes they’re UTF-16LE, and in those cases the BOM is set.

It would be nice if you could optionally specify that the codec would assume UTF-16BE if no BOM was present, and not raise UnicodeError in that case, which would preserve the current behaviour as well as allow users’ to ask for behaviour which conforms to the standard.

I’m not saying that you can’t work around the issue now, what I’m saying is that you shouldn’t have to - I think there is a reasonable expectation that the UTF-16 codec conforms to the spec, and if you wanted it to do something else, it is those users who should be forced to come up with a workaround.—Nicholas

It should be feasible to implement your own codec for that based on Lib/encodings/utf_16.py. Simply replace the line in StreamReader.decode():   raise UnicodeError,"UTF-16 stream does not start with BOM" with:   self.decode = codecs.utf_16_be_decode and you should be done.

Bye,—Walter

Oops, this only works if you have a big endian system. Otherwise you have to redecode the input with:   codecs.utf_16_ex_decode(input, errors, 1, False)

Bye,—Walter

Alternatively, the UTF-16BE codec could support the BOM, and do UTF-16LE if the "other" BOM is found.

This would also support your usecase, and in a better way. The Unicode assertion that UTF-16 is BE by default is void these days - there is always a higher layer protocol, and it more often than not specifies (perhaps not in English words, but only in the source code of the generator) that the default should by LE.

Regards, Martin—"Martin

I’ve checked the various versions of the Unicode standard docs: it seems that the quote you have was silently introduced between 3.0 and 4.0.

Python currently uses version 3.2.0 of the standard and I don’t think enough people are aware of the change in the standard to make a case for dropping the exception raising in the case of a UTF-16 finding a stream without a BOM mark.

By the time we switch to 4.1 or later, we can then make the change in the native UTF-16 codec as you requested.

Personally, I think that the Unicode consortium should not have introduced a default for the UTF-16 encoding byte order. Using big endian as default in a world where most Unicode data is created on little endian machines is not very realistic either.

Note that the UTF-16 codec starts reading data in the machines native byte order and then learns a possibly different byte order by looking for BOMs.

Implementing a codec which implements the 4.0 behavior is easy, though.—M.-A.

 
 
 
 

That is _not_ a protocol. A protocol is a published specification, not merely a frequent accident of implementation. Anyway, both ISO 10646 and the Unicode standard consider that "internal use" and there is no requirement at all placed on those data. And such generators typically take great advantage of that freedom---have you looked in a .doc file recently? Have you noticed how many different options (previous implementations) of .doc are offered in the Import menu?—Stephen