[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: about strings, symbols and chars.

From: Jim Blandy
Subject: Re: about strings, symbols and chars.
Date: 19 Dec 2000 17:21:36 -0500

Dirk Herrmann <address@hidden> writes:
> On 29 Nov 2000, Jim Blandy wrote:
> > In August 1999 I set up a plan to add multilingual text support to
> > Guile.  The plan was intended to allow Guile to arbitrarily mix text
> > from different languages in strings.
> For the curious:  the proposal can be found under:
>   guile-doc, ref/mbapi.texi
> BTW:  I don't quite see the problems about using several different
> fixed-width encodings instead of a multibyte encoding.  It is claimed that
> "with n different fixed-string encodings, users would have to write n
> versions of any code that manipulates strings directly".  I don't
> understand this claim, or, stated differently, I don't see why it
> shouldn't be possible to have a generic API working on different
> fixed-width encodings?  In the following example I assume that the
> encoding width is stored in the most significant bits of the string
> object's type cell and that only widths up to 4 are possible.  The strings
> in the example have a maximum length of 2^22 characters.  This is just an
> example.  A different string object layout can be chosen, where there are
> no such restrictions.
> #define SCM_STRING_LENGTH(s) ((SCM_CELL_WORD_0 (s) >> 8) & 0x3fffff)
> #define SCM_STRING_ENCODING_WIDTH(s) (((SCM_CELL_WORD_0 (s) >> 30) & 3) + 1)
> #define SCM_STRING_BASE(s) ((unsigned char *) SCM_CELL_WORD_1 (s))
> #define SCM_CHAR_GET(p, w) \
>   (w == 1 ? (unsigned long int) (* (unsigned char *) p) \
>           : (w == 2 ? (unsigned long int) (* (unsigned short *) p) \
>                     : (* (unsigned lont int *) p)))
> compare (SCM str1, SCM str2)
> {
>     unsigned long int lenght1 = SCM_STRING_LENGTH (str1);
>     unsigned long int lenght2 = SCM_STRING_LENGTH (str2);
>     unsigned char *base1 = SCM_STRING_BASE (str1);
>     unsigned char *base2 = SCM_STRING_BASE (str2);
>     unsigned int width1 = SCM_STRING_ENCODING_WIDTH (str1);
>     unsigned int width2 = SCM_STRING_ENCODING_WIDTH (str2);
>     unsigned long int i;
>     if (length1 != length2) return SCM_BOOL_F;
>     for (i = 0; i != length1; ++i)
>       {
>         scm_char_t c1 = SCM_CHAR_GET (base1, width1);
>         scm_char_t c2 = SCM_CHAR_GET (base2, width1);
>         if (c1 != c2) return SCM_BOOL_F;
>         base1 += width1;
>         base2 += width2;
>       }
>     return SCM_BOOL_T;
> }

There are two issues I'm concerned with.

- I don't think something like SCM_CHAR_GET will be fast enough.

  The whole point of mbapi.texi is to give the programmer enough
  information to directly hack on the bytes themselves.  People will
  be writing Boyer-Moore searches, regexp engines, conversions to
  other encodings, etc.  When you've got loops that touch every
  character you process, you've got to haul the conditionals up out of
  the central loop, so SCM_CHAR_GET isn't good enough.

  Certainly, scm_mb_get will be as slow as, or slower than,
  SCM_CHAR_GET.  It's there for the cases where people need simplicity
  over performance.

- One of the hardest parts about multilingualization is actually
  getting people to do it.

  There is an enormous temptation for C programmers to simply declare,
  "My code only works on 8-bit strings" and get back to work on
  whatever they really wanted to code.  All it takes is one module
  along a datapath to be negligent in this regard to make the whole
  system effectively unilingual.  Programmers are likely to ignore a
  mixed-representation system.  Guile originally had such a string
  type, and, in fact, people ignored it.  Guile was full of functions
  that only supported the 8-bit variant.  It was useless.

  It is my feeling that a single representation which allows direct
  access to the bytes and can almost always be handled exactly like
  ASCII data is more likely to be an acceptable burden.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]