Re: [Chicken-users] multilingual fowl

chicken-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] multilingual fowl

From:	Alex Shinn
Subject:	Re: [Chicken-users] multilingual fowl
Date:	Wed, 29 Sep 2004 03:42:34 -0500
User-agent:	Wanderlust/2.10.1 (Watching The Wheels) SEMI/1.14.6 (Maruoka) FLIM/1.14.6 (Marutamachi) APEL/10.6 Emacs/21.3 (i386-pc-linux-gnu) MULE/5.0 (SAKAKI)

At Wed, 29 Sep 2004 07:51:50 +0200, Felix Winkelmann wrote:
> 
> > However it's more convenient to just
> > program with normal string procedures all the time and decided when
> > you need to (require 'utf8) and when your app doesn't need i18n.
> 
> Yes, that would be possible. But one has to keep in mind that the
> compiler will replace calls to string primitives with non-unicode
> aware inline C calls when compiling with -O2 or higher, or with
> -usual-integrations. This would require something like
> 
> (declare (not standard-bindings string-ref ...)))
> 
> If the set of specially handled primitives is small enough, we
> could of course fix the inline routines accordingly.

The following 9 R5RS procedures would be changed:

  make-string string string-ref string-set! string-length
  substring string->list list->string string-fill!

I'm guessing these don't get inlined, but in unit extras we would have
3 I/O procedures:

  read-string write-string read-token

and 7 string procedures:

  string-chop string-translate string-translate*
  substring=? substring-ci=? substring-index substring-index-ci

Also most of regex, SRFI-13 and SRFI-14 would be replaced.

> [...] My idea about unicode
> was actually to keep this separate (like bigloo's ucs2-... routines),
> even though a "native" handling of unicode is cleaner from the user's
> point of view than a separate, distinct datatype.

Bigloo's ucs2- routines are completely disjoint from strings.

  (string-append "abc" #u"def")  =>  error

This is similar to having bignums as a disjoint type (as Chicken does
with the GMP egg and Bigloo does partially with llongs).  I've used
disjoint bignums before to implement an RNG algorithm.  The
intermediate values were bignums, but the result was always modulo
some fixnum so as a whole the bignum use was encapsulated.  That's the
ideal scenario.  Alternately you may be writing a scientific
application or some such and if you want bignums you write the entire
thing using the bignum procedures.  For the most part the values you
work with are encapsulated and you probably won't find yourself
accidentally passing a bignum value as the index to vector-ref, so
again this isn't too bad.  Where you really lose out is in libraries.
If you want a library for some set of numerical operations such as
linear algebra or calculus or symbolic computation, then you need to
have a complete duplicate version of the library for bignums.

Strings, on the other hand, are sluts.  They go from port to port like
a randy sailor.  They come from the command line and environment
variables in unknown locales, they come from config files and
databases and are embedded in URLs and MIME headers.  You're forever
comparing them and searching them and passing them to foreign and
library procedures.  Consider a library like SSAX - should it return
normal strings or utf-8 strings?  If I want to search for a record in
a file whose <name> tag matches a user's name, I'm going to want utf-8
strings, but the string "name" itself I'm likely to think of as a
normal string.  In particular with a tool for this kind of search like
SXPATH you need this to be consistent, but it's not clear which
strings should be which type.  You end up with three options: 1)
assume all strings passed to and from external libraries are normal
strings and translate manually at the borders, 2) translate
automatically at borders, or 3) provide two versions of each library,
one for normal strings and one for utf-8 strings.  3 becomes silly
because every app and library that uses any string library at all
(basically all of them) will be doubled.  2 is pointless because you
lose any efficiency gain you might have had from making the types
disjoint.  1 is a huge amount of programmer work and is just asking
for subtle bugs, and with frequent conversions you may again be losing
your performance gain, especially if we compute the length for the
boxed utf-8 strings.

Now, if we keep the string types unified and just make the procedures
used to access them disjoint (normal vs. utf-8) then the problems go
away.  Look at any of the string based eggs we have currently: ssax,
htmlprag, csv, url, etc.  None of them would see the overrides we are
providing, but because they do parsing based on ASCII delimiters it
doesn't matter - they will parse and return utf-8 strings correctly,
and no type conversions in or out are needed.  The only time we have
to modify a library to be utf-8 aware is when they are concerned with
string lengths or indexes.  Skimming the current eggs it doesn't look
like any are affected.

> I just hope that one day I don't have to debug klingon source code... ;-)

Ah... sorry about the poor readability of the code, I just threw that
together in a couple hours to see if there was any interest in it.
I'll clean it up :)

-- 
Alex

[Prev in Thread]

Current Thread

[Next in Thread]

[Chicken-users] multilingual fowl, Alex Shinn, 2004/09/28
- Re: [Chicken-users] multilingual fowl, Felix Winkelmann, 2004/09/29
  - Re: [Chicken-users] multilingual fowl, Alex Shinn <=
    - Re: [Chicken-users] multilingual fowl, Felix Winkelmann, 2004/09/29
    - Re: [Chicken-users] multilingual fowl, Alex Shinn, 2004/09/29
- Re: [Chicken-users] multilingual fowl, Sergey Khorev, 2004/09/29
  - Re: [Chicken-users] multilingual fowl, Alex Shinn, 2004/09/30

Prev by Date: Re: [Chicken-users] Error 70, what does it mean?
Next by Date: Re: [Chicken-users] Error 70, what does it mean?
Previous by thread: Re: [Chicken-users] multilingual fowl
Next by thread: Re: [Chicken-users] multilingual fowl
Index(es):
- Date
- Thread