bug-bash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying


From: John Kearney
Subject: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do?
Date: Sun, 19 Feb 2012 23:07:45 +0100
User-agent: Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20120129 Thunderbird/10.0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Can somebody explain to me what u32tochar is trying to do?

It seems like dangerous code?

from the context i'm guessing it trying to make a hail mary pass at
converting utf-32 to mb (not utf-8 mb)


int
u32tochar (x, s)
     unsigned long c;
     char *s;
{
  int l;

  l = (x <= UCHAR_MAX) ? 1 : ((x <= USHORT_MAX) ? 2 : 4);

  if (x <= UCHAR_MAX)
    s[0] = x & 0xFF;
  else if (x <= USHORT_MAX)     /* assume unsigned short = 16 bits */
    {
      s[0] = (x >> 8) & 0xFF;
      s[1] = x & 0xFF;
    }
  else
    {
      s[0] = (x >> 24) & 0xFF;
      s[1] = (x >> 16) & 0xFF;
      s[2] = (x >> 8) & 0xFF;
      s[3] = x & 0xFF;
    }
  /* s[l] = '\0';  Overwrite Buffer?*/
  return l;
}

Couple problems with that though
firstly utf-32 doesn't map directly to non utf mb locals. So you need
a translation mechanism.
Secondly Normal CJK system are state based systems so mutibyte
sequences need to be escaped. Extended Unix Code would need encoding
somewhat like utf-8, in fact any variable multi byte encoding system
is going to need some context to recover the info this is unparsable
behavior,

what it is actually doing is taking utf-32 and depending on the size
encoding it as UTF-32 Big Endian , UTF-16 Big Endian, UTF-8, or
American EAscii codepage(values between 0x80 - 0xff). Choosing one of
these is however Dependant on LC_CTYPE not some arbitrary check.

So this function just seems plain crazy?
I  think that all it can safely do is this.
int
utf32tomb (x, s)
     unsigned long c;
     char *s;
{

  if (x <= 0x7f ) /* x>=0x80 = locale specific */
     {
         s[0] = x & 0xFF;
         return 1;
     }
  else
    return 0
}



regarding naming convention u32 = unsigned 32 bit
might be a good idea to rename all the utf32 functions to utf32, would
I think save a lot of confusion in the code as to what is going on.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPQXKxAAoJEKUDtR0WmS054sgH/R+qWtds9MMeN/y4n98wk83l
MAOVBXAn+m8IUf31VtSZ7nqEccJHDPDRMkg21sYNlozsxPVwCYOGZd7LL8wxlwEl
70mRu9cAQOXIAeF9b8ao0/nz6e6nC6FTk03FDhDo+V8RWt9MiQHF4YWRCCmSdmQv
GDM88XyXuQZaBwIHrXeCXRvuXTN8K5BrdbVFJ7OHRUytKNE6OccUDz/iaPCoPy5f
SehHTLJ6AqpYy7NgapyALTvo3/FlVUDc7vtYbCDF5Q0EMIlvjgEQ9Y7vJlKtuAop
9Up32sQSy8red6frOgZmvA5GLeD7Lp/gvfp/U5fQWIZTKKLgBee2mYVqPlLOKw4=
=nHdc
-----END PGP SIGNATURE-----



reply via email to

[Prev in Thread] Current Thread [Next in Thread]