bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] Need help on UTF-8 to asian conversion (cp92, 936, 94


From: Hadrien Dussuel
Subject: [bug-gnu-libiconv] Need help on UTF-8 to asian conversion (cp92, 936, 949, 950)
Date: Wed, 25 Feb 2015 14:35:46 +0100

Hi!

I'm a developer of a mod for Civilization IV. We have used iconv tables to open the game to new languages. To do this, the game now reads UTF-8 in xml files and convert those to Windows codepages, and thus, allows the game to run in all 1 byte languages.

Now, we're are implementing asian languages support but i'm quite lost with the conversion function. The asian characters are oftenly coded on 3 bytes, and the game will read each byte as a char. I'm trying to adapt the iconv function to gather the three chars (to make a wchar) and guess the asian char.

Let's take an example with a korean string: 아브라함 (UTF8).
The game reads ì•„ë¸Œë ¼í•¨.
The following adresses are:
아 : EC 95 84
브 : EB B8 8C
라 :  EB 9D BC
함 :  ED 95 A8

The string is read as follow: ì•„ë¸Œë ¼í•¨
ì : EC
• : 95
„ : 84
ë : EB
¸ : B8
Π: 8C
ë : EB
9D (unprintable)
¼ : BC
í : ED
• : 95
¨ : A8

Now, here is the original iconv function:
static int
cp949_mbtowc (conv_t conv, ucs4_t *pwc, const unsigned char *s, int n)
{
  unsigned char c = *s;
  /* Code set 0 (ASCII) */
  if (c < 0x80)
    return ascii_mbtowc(conv,pwc,s,n);
  /* UHC part 1 */
  if (c >= 0x81 && c <= 0xa0)
    return uhc_1_mbtowc(conv,pwc,s,n);
  if (c >= 0xa1 && c < 0xff) {
    if (n < 2)
      return RET_TOOFEW(0);
    {
      unsigned char c2 = s[1];
      if (c2 < 0xa1)
        /* UHC part 2 */
        return uhc_2_mbtowc(conv,pwc,s,n);
      else if (c2 < 0xff && !(c == 0xa2 && c2 == 0xe8)) {
        /* Code set 1 (KS C 5601-1992, now KS X 1001:1998) */
        unsigned char buf[2];
        int ret;
        buf[0] = c-0x80; buf[1] = c2-0x80;
        ret = ksc5601_mbtowc(conv,pwc,buf,2);
        if (ret != RET_ILSEQ)
          return ret;
        /* User-defined characters */
        if (c == 0xc9) {
          *pwc = 0xe000 + (c2 - 0xa1);
          return 2;
        }
        if (c == 0xfe) {
          *pwc = 0xe05e + (c2 - 0xa1);
          return 2;
        }
      }
    }
  }
  return RET_ILSEQ;
}

It only expects 2 chars/bytes and we have 3, so i don't understand what should i do to process the multibytes into a wchar. For the example, EC 95 84, how to handle the conversion ? The first byte is still important as EB 95 84, ED 95 84, are also characters...

Thank you for your help,
Hadrien.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]