bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uuencode: multi-bytes char in remote file name contains bytes >0x80


From: Eric Blake
Subject: Re: uuencode: multi-bytes char in remote file name contains bytes >0x80
Date: Wed, 06 Jul 2011 15:15:54 -0600
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110428 Fedora/3.1.10-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.10

On 07/06/2011 02:21 PM, Bruce Korb wrote:
> On 07/06/11 12:55, Bruno Haible wrote:
>> Bruce Korb wrote:
>>> I think the arguments are sufficient to make the changes.
>>> The change will include uudecode changes so it can detect
>>> and handle the encoded file names, and uudecode will get
>>> an "encode-filename" ("-e") option.
>>
>> Where and how will the charset conversion of the filenames be handled?
> 
> Yes, it will be.

The only sane approach is to assume that the current locale of the user
running uuencode normally sees sane filenames, and transliterate from
the user's locale into UTF-8.  Either the filename is a character string
in the user's current locale (and therefore, every character can be
transliterated into UTF-8; perhaps trivially if the user's locale is
already UTF-8), or the filename is already random bytes that the user
cannot see as characters in their current locale.  In the latter case,
you can still do a 1:1 mapping, where all invalid bytes are mapped to a
2nd-half of a UTF-8 surrogate pair.

Then, take that UTF-8 multibyte sequence (including 2nd-half surrogate
pair mappings for all invalid bytes that were not characters), and
flatten it into something that is just ascii.

On the uudecode side, take the ascii and convert it back to UTF-8, then
transliterate into the user's current locale.  Here, the transliteration
might be lossy (if the user's charset doesn't support all the characters
that were in the input) - here, I'm not sure whether best practice is to
transliterate from the unrepresentable character to '?' or to leave the
unrepresentable character as raw Unicode bytes (the latter is what leads
to mojibake).  But if the receiver's current locale is UTF-8, lossy
transliteration is not an issue.  Meanwhile, if the encoded string
contained any unmatched 2nd-half surrogate pairs, you can unambiguously
recover the raw byte that was not a character, and use that byte as-is.

The nice part about this algorithms is that if both sender and receiver
only use a subset of characters that exist in both charsets, then they
both see the same filename, even if the two locations are using
different charset.  If the receiver is using UTF-8 (which is more and
more common these days), they will see whatever name the sender saw
regardless of the sender's charset.  The only place where mojibake still
happens if the sender uses characters that are not in the receivers
charset - and that's not entirely a real loss, since it was already the
case that the sender is doing non-portable things by sending
non-portable filename characters in the first place.

>>       or
>>          =?utf-8?Q?j=C3=B6rg?=

You want some sort of utf-8 encoding, and preferably one that encodes
only the non-portable characters.  This type of naming looks best to me.

-- 
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]