I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

I’ve checked the various versions of the Unicode standard docs: it seems that the quote you have was silently introduced between 3.0 and 4.0.

Python currently uses version 3.2.0 of the standard and I don’t think enough people are aware of the change in the standard to make a case for dropping the exception raising in the case of a UTF-16 finding a stream without a BOM mark.

By the time we switch to 4.1 or later, we can then make the change in the native UTF-16 codec as you requested.

Personally, I think that the Unicode consortium should not have introduced a default for the UTF-16 encoding byte order. Using big endian as default in a world where most Unicode data is created on little endian machines is not very realistic either.

Note that the UTF-16 codec starts reading data in the machines native byte order and then learns a possibly different byte order by looking for BOMs.

Implementing a codec which implements the 4.0 behavior is easy, though.—M.-A.

Probably because ISO 10646 was _always_ BE until the standards were unified. But note that ISO 10646 standardizes only use as a communications medium. Neither ISO 10646 nor Unicode makes any specification about internal usage. Conformance in internal processing is a matter of the programmer’s convenience in producing conforming output.—Stephen

While in principle I sympathize with Nick, pragmatically Microsoft is unlikely to conform. They will take the position that files created by Windows are "internal" to the Windows environment, except where explicitly intended for exchange with arbitrary platforms, and only then will they conform. As Martin points out, that is what really matters for these defaults. I think you should look to see what Microsoft does.—Stephen

It’s not a default for the UTF-16 encoding byte order. It’s a default for the UTF-16 encoding byte order _when UTF-16 is a communications medium_. Given that the generic network byte order is bigendian, I think it would be insane to specify littleendian as Unicode’s default.

With Unicode same as network, you specify UTF-16 strings internally as an array of uint16_t, and when you put them on the wire (including saving them to a file that might be put on the wire as octet-stream) you apply htons(3) to it. On reading, you apply ntohs(3) to it. The source code is portable, the file is portable. How can you beat that?—Stephen