[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: about strings, symbols and chars.
From: |
Jim Blandy |
Subject: |
Re: about strings, symbols and chars. |
Date: |
19 Dec 2000 17:21:36 -0500 |
Dirk Herrmann <address@hidden> writes:
> On 29 Nov 2000, Jim Blandy wrote:
> > In August 1999 I set up a plan to add multilingual text support to
> > Guile. The plan was intended to allow Guile to arbitrarily mix text
> > from different languages in strings.
>
> For the curious: the proposal can be found under:
> guile-doc, ref/mbapi.texi
>
> BTW: I don't quite see the problems about using several different
> fixed-width encodings instead of a multibyte encoding. It is claimed that
> "with n different fixed-string encodings, users would have to write n
> versions of any code that manipulates strings directly". I don't
> understand this claim, or, stated differently, I don't see why it
> shouldn't be possible to have a generic API working on different
> fixed-width encodings? In the following example I assume that the
> encoding width is stored in the most significant bits of the string
> object's type cell and that only widths up to 4 are possible. The strings
> in the example have a maximum length of 2^22 characters. This is just an
> example. A different string object layout can be chosen, where there are
> no such restrictions.
>
> #define SCM_STRING_LENGTH(s) ((SCM_CELL_WORD_0 (s) >> 8) & 0x3fffff)
> #define SCM_STRING_ENCODING_WIDTH(s) (((SCM_CELL_WORD_0 (s) >> 30) & 3) + 1)
> #define SCM_STRING_BASE(s) ((unsigned char *) SCM_CELL_WORD_1 (s))
> #define SCM_CHAR_GET(p, w) \
> (w == 1 ? (unsigned long int) (* (unsigned char *) p) \
> : (w == 2 ? (unsigned long int) (* (unsigned short *) p) \
> : (* (unsigned lont int *) p)))
>
> SCM
> compare (SCM str1, SCM str2)
> {
> unsigned long int lenght1 = SCM_STRING_LENGTH (str1);
> unsigned long int lenght2 = SCM_STRING_LENGTH (str2);
> unsigned char *base1 = SCM_STRING_BASE (str1);
> unsigned char *base2 = SCM_STRING_BASE (str2);
> unsigned int width1 = SCM_STRING_ENCODING_WIDTH (str1);
> unsigned int width2 = SCM_STRING_ENCODING_WIDTH (str2);
> unsigned long int i;
>
> if (length1 != length2) return SCM_BOOL_F;
> for (i = 0; i != length1; ++i)
> {
> scm_char_t c1 = SCM_CHAR_GET (base1, width1);
> scm_char_t c2 = SCM_CHAR_GET (base2, width1);
> if (c1 != c2) return SCM_BOOL_F;
> base1 += width1;
> base2 += width2;
> }
> return SCM_BOOL_T;
> }
There are two issues I'm concerned with.
- I don't think something like SCM_CHAR_GET will be fast enough.
The whole point of mbapi.texi is to give the programmer enough
information to directly hack on the bytes themselves. People will
be writing Boyer-Moore searches, regexp engines, conversions to
other encodings, etc. When you've got loops that touch every
character you process, you've got to haul the conditionals up out of
the central loop, so SCM_CHAR_GET isn't good enough.
Certainly, scm_mb_get will be as slow as, or slower than,
SCM_CHAR_GET. It's there for the cases where people need simplicity
over performance.
- One of the hardest parts about multilingualization is actually
getting people to do it.
There is an enormous temptation for C programmers to simply declare,
"My code only works on 8-bit strings" and get back to work on
whatever they really wanted to code. All it takes is one module
along a datapath to be negligent in this regard to make the whole
system effectively unilingual. Programmers are likely to ignore a
mixed-representation system. Guile originally had such a string
type, and, in fact, people ignored it. Guile was full of functions
that only supported the 8-bit variant. It was useless.
It is my feeling that a single representation which allows direct
access to the bytes and can almost always be handled exactly like
ASCII data is more likely to be an acceptable burden.
- Re: about strings, symbols and chars.,
Jim Blandy <=