[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wide string strategies

From: Mike Gran
Subject: Re: Wide string strategies
Date: Fri, 10 Apr 2009 10:14:00 -0700 (PDT)

> From: Ludovic Courtès <address@hidden>
> Mike Gran writes:
> > On Thu, 2009-04-09 at 22:25 +0200, Ludovic Courtès wrote: 

> Actually, for the file system interface, for instance, it's even
> trickier: the encoding of file names usually isn't specified, but some
> apps/libraries have their opinion on that, e.g., Glib
> (
> We should probably follow their lead here, but that's a secondary
> problem anyway.

True.  The one real standard that I do know is that NTFS requires UTF-8 

> > Also, the interaction between strings and sockets needs more thought.
> > If sendto and recvfrom are used for datagram transmission, as it
> > suggests in their docstrings, then locale string conversion could be a
> > bad idea.  (And, these functions should also operate on u8vectors, but
> > that's another issue.)
> Agreed.
> > To be more general, I know some apps depend on 8-bit strings and use
> > them as storage of non-string binary data.
> Yes, notably because of `sendto' et al. that take a string.
> > I think SND falls into this
> > category.  I wonder if ultimately wide strings would have to be a
> > run-time option that is off by default.  But I am (choose your English
> > idiom here) getting ahead of myself, or jumping the gun, or putting the
> > cart before the horse.
> I don't have any idea of how we could usefully handle that.
> Eventually, it may be a good idea to deprecate `(sento "foobar")' in
> favor of a variant that takes a bytevector or some such.

Maybe its best to leave them unchanged w.r.t strings.  Any char values between
128 and 255 would just be interpreted as if they were UCS-4 characters
128 to 255 and get put in the strings directly.

In the short term, socket functions could also be modified
to take both strings and u8vectors.  Then, if someone was actually 
pushing UTF strings over the network, they could use 
"utf8-encoded-u8vector->string" or some such to do the conversion.

And, in the long run, sockets can become a type of port, and those
ports can have attached transcoding.

> >> > +SCM_INTERNAL int scm_i_string_ref_eq_int (SCM str, size_t x, int c);
> >> 
> >> Does it assume sizeof (int) >= 32 ?
> >
> > I suppose it does.  But, I only used it to compare to the output of
> > scm_getc which also returns an int.
> I meant, is the intent that C contains a codepoint?

Yes.  And when wide strings are implemented, the gnulib convention is
that a wide character is represented in C as uint32.

> >> > +SCM_INTERNAL char *scm_i_string_to_write_sz (SCM str);
> >> > +SCM_INTERNAL scm_t_uint8 *scm_i_string_to_u8sz (SCM str);
> >> > +SCM_INTERNAL SCM scm_i_string_from_u8sz (const scm_t_uint8 *str);
> >> > +SCM_INTERNAL const char *scm_i_string_to_failsafe_ascii_sz (SCM str);
> >> > +SCM_INTERNAL const char *scm_i_symbol_to_failsafe_ascii_sz (SCM str);

> How about:
>   SCM scm_i_from_ascii_string (const scm_t_uint8 *str);
> and similar?


> >> 
> >> > +/* For ASCII strings, SUB can be used to represent an invalid
> >> > +  character.  */
> >> > +#define SCM_SUB ('\x1A')
> >> 
> >> Why SUB?  How about `SCM_I_SUB_CHAR', `SCM_I_INVALID_ASCII_CHAR' or
> >> similar?
> >
> > If you're asking why SUB is set to 0x1A, the standard EMCA-48 says 0x1A
> > should be used to indicate an invalid ASCII character.
> I suspected that.  Then `SCM_I_SUB_CHAR' may be a good name, perhaps
> with a comment saying that this is the "official SUB character".




reply via email to

[Prev in Thread] Current Thread [Next in Thread]