[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Wed, 04 Feb 2009 00:23:14 +0000
Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
Thanks for explaining...
Mike Gran <address@hidden> writes:
> Right now, the internal coding of strings is an unspecified 8-bit
> encoding, and is assumed to be compatible with the locale in which it
> is being run.
> So if I have a guile string with some 8-bit character that is between
> 128 and 255, it just gets passed through. If I request the contents
> of that string from C with scm_to_locale_string, it just returns the
> buffer of the scheme string.
> But, in future, scm_to_locale_string or scm_to_locale_stringbuf should
> actually do the proper conversion to the current locale so that wide
> characters are printed properly.
> So, if we move the internal representation of strings away from
> unspecified 8-bit data and toward something concrete, like ISO-8859-1
> or UCS-4, and if a program is running in an environment where a locale
> that has a multibyte encoding like UTF-8, then the created locale
> string could have multi-byte characters.
> Consider a scheme string that is internally the single character
> "LATIN SMALL LETTER A WITH ACUTE", which is U+00E1. If the locale
> were some sort of UTF-8, like en_US.utf-8, this letter should become
> the two bytes 0xC3 and 0xA1 when converted to the locale.
Right. I'm happy with all this.
> So what should happen in this case if I call scm_to_locale_stringbuf
> (str, buf, 1)? Note that here BUF can only contain 1 byte.
I think the key thing is that scm_to_locale_stringbuf () will return
2. This tells the caller that BUF wasn't big enough. Beyond that, we
shouldn't do something obviously misleading, but I don't think it
matters very much what we choose to do.
> the one byte 0xC3 be copied into it, which creates an illegal
No. I agree that that would feel "obviously misleading".
> Or, should nothing be copied into it.
That - in other words no change to BUF at all - sounds good to me.
> In either case, there should be some mechanism in the API to
>provide information that an incomplete last character has occurred,
>because outputting just the one byte 0xC3 would cause problems
>somewhere down the road.
I don't follow your "in either case" - because in the second case we
haven't output 0xC3.
You may still be right that we need some mechanism to say that some
bytes at the end of BUF were not used, but the case for this isn't
obvious to me yet.
> So what I was saying was that in this case maybe the best thing to do
> would be to pad the output buffer with '\0' instead of putting in half
> of a multibyte character,
Padding feels wrong to me. We wouldn't pad if the caller supplied a
BUF of length 10 and a string that needed only 3 bytes.
> Sorry for the book-length explanation,
No problem. I think the key question remains: why is the existing API
(i.e. the existing return value) not good enough?
I guess there could be a scenario where the caller has a fixed size
buffer, and just wants to copy in as much of an arbitrary string as
will fit, and then use that possibly truncated string somehow.
Depending on the API that the string is being passed on to, any of the
following could be most useful:
- padding the unused bytes of BUF with \0 (or some other value)
- adding a single \0 (or other value) in the first unused byte
- returning a pointer (or offset in bytes) to the first unused byte
- returning the number of characters written.
Returning both <number of chars written> and <number of bytes used>
would allow the caller to do any of those efficiently, so perhaps we
should do that?