I recently rediscovered this strange behaviour in Python’s Unicode handling.—Evan

OK, here is the patch: <python.org>

The stateful decoder has a little problem: At least three bytes have to be available from the stream until the StreamReader decides whether these bytes are a BOM that has to be skipped. This means that if the file only contains "ab", the user will never see these two characters.

A solution for this would be to add an argument named final to the decode and read methods that tells the decoder that the stream has ended and the remaining buffered bytes have to be handled now.

Bye,—Walter

Shouldn’t the decoder be capable of doing a partial match and quitting early? After all, "ab" is encoded in UTF8 as <61> <62> but the BOM is <ef> <bb> <bf>. If it did this type of partial matching, this issue would be avoided except in rare situations.—Evan

Theoretically the name is unimportant, but read(..., final=True) or flush() or close() should subject the pending bytes to normal error handling and must return the result of decoding these pending bytes just like the other methods do. This would mean that we would have to implement a decodecode(), a readclose() and a readlineclose(). IMHO it would be best to add this argument to decode, read and readline directly. But I’m not sure, what this would mean for iterating through a StreamReader.

Bye,—Walter

This can be improved, of course: If the first byte is "a", it most definitely is not an UTF-8 signature.

So we only need a second byte for the characters between U+F000 and U+FFFF, and a third byte only for the characters U+FEC0...U+FEFF. But with the first byte being \xef, we need three bytes anyway, so we can always decide with the first byte only whether we need to wait for three bytes.—"Martin

There are situations where the byte stream might be temporarily exhausted, e.g. an XML parser that tries to support the IncrementalParser interface, or when you want to decode encoded data piecewise, because you want to give a progress report.

Bye,—Walter

Yes, but these are not file-like objects. In the IncrementalParser, it is not the case that a read operation returns an empty string. Instead, the application repeatedly feeds data explicitly. For a file-like object, returning "" indicates EOF.

Regards, Martin—"Martin