I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

►

-1; there’s no standard for UTF-8 BOMs—adding it to the codecs module was probably a mistake to begin with.—M.-A.

►

There is a standard for UTF-8 signatures, however.—Stephen

►

With the UTF-8-SIG codec, it would apply to all operation modes of the codec, whether stream-based or from strings.—"Martin

►

I had in mind the ability to treat a string as a stream.—Stephen

►

Hmm. A string is not a stream, but it could be the contents of a stream.—"Martin

►

Of course it must be supported.—Stephen

►

I can’t put these two paragraphs together.—"Martin

►

Because the signature/BOM is not a chunk, it’s a header. Handling the signature/BOM is part of stream initialization, not translation, to my mind.

The point is that explicitly using a stream shows that initialization (and finalization) matter. The default can be BOM or not, as a pragmatic matter. But then the stream data itself can be treated homogeneously, as implied by the notion of stream.

I think it probably also would solve Walter’s conundrum about buffering the signature/BOM if responsibility for that were moved out of the codecs and into the objects where signatures make sense.

I don’t know whether that’s really feasible in the short run---I suspect there may be a lot of stream-like modules that would need to be updated---but it would be a saner in the long run.—Stephen

Not really. In every encoding where a sequence of more than one byte maps to one Unicode character, you will always need some kind of buffering. If we remove the handling of initial BOMs from the codecs (except for UTF-16 where it is required), this wouldn’t change any buffering requirements.—Walter

Sure. My point is that codecs should be stateful only to the extent needed to assemble semantically meaningful units (ie, multioctet coded characters). In particular, they should not need to know about location at the beginning, middle, or end of some stream---because in the context of operating on a string they _can’t_.—Stephen

I’m not exactly sure, what you’re proposing here. That all codecs (even UTF-16) pass the BOM through and some other infrastructure is responsible for dropping it?

Bye,—Walter

Not exactly. I think that at the lowest level codecs should not implement complex mode-switching internally, but rather explicitly abdicate responsibility to a more appropriate codec.

For example, autodetecting UTF-16 on input would be implemented by a Python program that does something like

data = stream.read() for detector in [ "utf-16-signature", "utf-16-statistical" ]: # for the UTF-16 detectors, OUT will always be u"" or None out, data, codec = data.decode(detector) if codec: break while codec: more_out, data, codec = data.decode(codec) out = out + more_out if data: # a real program would complain about it pass process(out)

where decode("utf-16-signature") would be implemented

def utf-16-signature-internal (data): if data[0:2] == "\xfe\xff": return (u"", data[2:], "utf-16-be") else if data[0:2] == "\xff\xfe": return (u"", data[2:], "utf-16-le") else # note: data is undisturbed if the detector fails return (None, data, None)

The main point is that the detector is just a codec that stops when it figures out what the next codec should be, touches only data that would be incorrect to pass to the next codec, and leaves the data alone if detection fails. utf-16-signature only handles the BOM (if present), and does not handle arbitrary "chunks" of data. Instead, it passes on the rest of the data (including the first chunk) to be handled by the appropriate utf-16-?e codec.

I think that the temptation to encapsulate this logic in a utf-16 codec that "simplifies" things by calling the appropriate utf-16-?e codec itself should be deprecated, but YMMV. What I would really like is for the above style to be easier to achieve than it currently is.

BTW, I appreciate your patience in exploring this; after Martin’s remark about different mental models I have to suspect this approach is just somehow un-Pythonic, but fleshing it out this way I can see how it will be useful in the context of a different project.—Stephen

I’m sorry, but I’m losing track as to what precisely you are trying to say. You seem to be using a mental model that is entirely different from mine.—"Martin

But what follows from that point? So it shows some kind of matter... what does that mean for actual changes to Python API?—"Martin

What is "that" which might be really feasible? To "solve Walter’s conundrum"? That "signatures make sense"?

So I can’t really respond to your message in a meaningful way; I just let it rest...

Regards, Martin—"Martin