I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

-1; there’s no standard for UTF-8 BOMs—adding it to the codecs module was probably a mistake to begin with.—M.-A.

►

There is a standard for UTF-8 signatures, however.—Stephen

►

With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings. Whether or not to use the codec would be the application’s choice.—"Martin

I’d suggest to use the same mode of operation as we have in the UTF-16 codec: it removes the BOM mark on the first call to the StreamReader .decode() method and writes a BOM mark on the first call to .encode() on a StreamWriter.

Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM.—M.-A.

I’ve started writing such a codec. Making the BOM optional on decoding definitely simplifies the implementation.

Bye,—Walter

OK, here is the patch: <python.org>

The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters.

A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now.

Bye,—Walter

I’ve actually been confused about this point for quite some time now, but never had a chance to bring it up. I do not understand why UnicodeError should be raised if there is no BOM. I know that PEP-100 says:

’utf-16’: 16-bit variable length encoding (little/big endian)

and:

Note: ’utf-16’ should be implemented by using and requiring byte order marks (BOM) for file input/output.

But this appears to be in error, at least in the current unicode standard. ’utf-16’, as defined by the unicode standard, is big-endian in the absence of a BOM:

--- 3.10.D42: UTF-16 encoding scheme: ... * The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. ---

The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. I allow for the possibility that this was ambiguous in the standard when the PEP was written, but it is certainly not ambiguous now.—Nicholas

I had in mind the ability to treat a string as a stream. —Stephen

Hmm. A string is not a stream, but it could be the contents of a stream.

A typical application of codecs goes like this:

data = stream.read() [analyze data, e.g. by checking whether there is encoding= in <?xml...] data = data.decode(encoding analyzed)

So people do use the "decode-it-all" mode, where no sequential access is necessary - yet the beginning of the string is still the beginning of what once was a stream. This case must be supported.—"Martin

Of course it must be supported. My point is that many strings (in my applications, all but those strings that result from slurping in a file or process output in one go -- example, not a statistically valid sample!) are not the beginning of "what once was a stream". It is error-prone (not to mention unaesthetic) to not make that distinction.

"Explicit is better than implicit."—Stephen

I can’t put these two paragraphs together. If you think that explicit is better than implicit, why do you not want to make different calls for the first chunk of a stream, and the subsequent chunks?—"Martin

Because the signature/BOM is not a chunk, it’s a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind.

The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream.

I think it probably also would solve Walter’s conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense.

I don’t know whether that’s really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run.—Stephen

What I think should be provided is a stateful object encapsulating the codec. Ie, to avoid the need to write

out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")—Stephen

No. People who want streaming should use cStringIO, i.e.

>>> s=cStringIO.StringIO() >>> s1=codecs.getwriter("utf-8")(s) >>> s1.write(u"Hallo") >>> s.getvalue() ’Hallo’

Regards, Martin—"Martin

Yes! Exactly (except in reverse, we want to _read_ from the slurped stream-as-string, not write to one)! ... and there’s no need for a utf-8-sig codec for strings, since you can support the usage in exactly this way.—Stephen

However, if there is an utf-8-sig codec for streams, there is currently no way of preventing this codec to also be available for strings. The very same code is used for streams and for strings, and automatically so.

Regards, Martin—"Martin

And of course it should be. But if it’s not possible to move the -sig facility out of the codecs into the streams, that would be a shame. I think we should encourage people to use streams where initialization or finalization semantics are non-trivial, as they are with signatures.

But as long as both utf-8-we-dont-need-no-steenkin-sigs-in-strings and utf-8-sig are available, I can program as I want to (and refer those whose strings get cratered by stray BOMs to you<wink>).—Stephen