bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] roundtrippable encoding support


From: David Kastrup
Subject: Re: [bug-libunistring] roundtrippable encoding support
Date: Wed, 15 Oct 2014 09:45:56 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4.50 (gnu/linux)

Daiki Ueno <address@hidden> writes:

> Ben Pfaff <address@hidden> writes:
>
>> On Thu, Oct 09, 2014 at 06:04:02PM +0200, David Kastrup wrote:
>>> What I am actually more interested in is in having libunistring offer
>>> "roundtrippable" encodings as a fallback for decoding errors.
>>> Basically, I want an option for decoding where libunistring announces
>>> "what you have here is not valid utf-8 but I know how to deal with it".
>>> Including reencoding.  And delivering unique "character codes" and
>>> string length calculations.  The application would either keep track of
>>> having received "dirty utf-8" and would reencode when putting out utf-8
>>> (where reencoding "internal utf-8" to "external utf-8" means replacing
>>> the 2-byte sequences representing a wild byte by their original byte),
>>> or it would reencode into "external" utf-8 when writing anyway which
>>> would not change anything for originally valid utf-8.
>>
>> It sounds like a reasonable philosophy to me.  I don't think I'd want
>> this to become the only option for libunistring, but if there's a
>> practical way to add alternate interfaces, etc., then I think that would
>> be valuable.
>
> I don't have anything to add.  I think it would be nice if Guile had a
> transparent support for "raw-bytes" and UTF-8 sequences[1], but I don't
> think it is a good idea to expose internal "character codes" or
> "internal utf-8" representation from the library interface.
>
> [1] for example, the results of decoding external byte sequences
>     "\xC2\xA0" and "\xA0" should report the same character code in REPL,
>     but they are internally distinguished and converted to the original
>     bytes when writing, like Emacs does.

In this respect I beg to differ.  Carrying "invisible" information is a
recipe for security problems and/or inscrutable behavior.  It would also
mean that in some use cases producing and consuming strings, this
invisible information would just disappear.

Since a "raw byte" is not the same as a character, I see no particular
point in selling it as something belonging to the Unicode codepoint
space: the purpose for this proposal is not that it would make it
particularly convenient to do codepoint-based processing on them: if it
were, one would not have decoded the byte stream in the first place.
And since a fair number of random byte combinations _do_ decode under
utf-8 into proper Unicode characters, this representation would not
really be much use in that respect.

Rather the point is being not to lose information by default and being
free to choose one's fallback strategies, including transparent
treatment of binary passages in mixed-mode files or streams, without
significant performance degradation.

Out-of-unicode-proper character points seem like they would usually work
well with things like positive and negative character ranges in regular
expressions, not matching where they are supposed not to match.

-- 
David Kastrup



reply via email to

[Prev in Thread] Current Thread [Next in Thread]