[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: i18n? unicode?
From: |
Simon Josefsson |
Subject: |
Re: i18n? unicode? |
Date: |
Wed, 13 Feb 2002 10:26:54 +0100 (CET) |
On 13 Feb 2002, Alex Shinn wrote:
> >>>>> "Simon" == Simon Josefsson <address@hidden> writes:
>
> Simon> Is anyone working on Unicode and/or support for various other
> Simon> encodings for guile strings?
>
> Simon> I guess this would be one major issue that needs to be done
> Simon> before a guile emacs can happen.
>
> Some work has been done off and on, but it's not a simple problem.
>From the little I have understood, I have understood this. :-) That's why
I'd like to see it designed cleanly or at least documented, instead of the
(to me) confusing Emacs MULE stuff.
> One of the big catches is that Guile wants to both replace Emacs-Lisp
> and extend well with C. For efficient multi-byte strings, Emacs-Lisp
> has its own string-representation, and the obvious idea would be to do
> likewise (probably using unicode instead of mule), but then you don't
> play well with C libraries and have to do conversions everywhere.
Is "automatic" conversions really necessary? The "automatic" (guessing)
logic of Emacs MULE seems to cause unexpected behaviour at some times.
> Another annoyance is that R5RS pretty clearly treats strings as
> character arrays, but many multibyte encodings are not arrays, so
> procedures like string-ref and string-set! become slow.
This is bad. Isn't there any standardisation work going on to fix this?
> The only Scheme I know of that has decent multibyte support is Gauche,
> and that is at the expense of performance on string-ref and the like.
> To make up for this it provides string pointers to loop through strings.
> A C API for extensions would presumably need to do explicit conversions.
Seems like a hack...
> Bigloo has limited ucs2 support, but not really unified - you have to
> know what strings you're working with.
Internally I think this approach seems best -- if you don't know what
strings you're working with, you can't expect things to work. Of course,
users can't be expected to know these things, but I don't see why users
need to concern themselves with the low-level interface..
> Kawa is implemented in Java, so has as good unicode support as Java.
> But then you're tied to Java.
Yup, the unicode support for it is great. But most other encodings, and
unification between them, isn't great as far as I understand.
I think basing this on the character set stuff available in GNU libc and
iconv would make it behave like "other" applications, which is a good
thing:
http://www.gnu.org/manual/glibc-2.2.3/html_chapter/libc_6.html
I'm not sure if the support is sufficient, but maybe it can be extended if
not.
> There are some preliminary charset conversion routines at
>
> http://synthcode.com/gumm/packages/a/ams/guile-charset-0.01.tar.gz
Thanks, I'll have a look at it.
> which only does uninteresting 8-bit conversions at the moment. One
> potential idea of this, though, is to implement multi-byte string
> handling entirely in Scheme, and redefine basic string/port procedures
> using generic methods to handle different string types. Kind of a hack
> (btw, this is how Perl5 does it) but could get people started writing
> multi-byte string apps and the upgrade (internal support for different
> strings in string procedures) means they won't have to change their
> code.
Yes... it would be kind of a hack. I'll look into the GNU libc path for
now.