bug-gnu-libiconv
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-gnu-libiconv] Possible CP932 conversions bug


From: Maxim Kouznetsov
Subject: [bug-gnu-libiconv] Possible CP932 conversions bug
Date: Mon, 12 Dec 2016 23:44:36 +0000

Hello,

 

While testing codepage conversions, I came across the following discrepancy: when converting from CP932 to UTF-16 certain characters get converted into different unicode on Linux (using iconv) and Mac (using libiconv). Looking at some CP932 to Unicode tables online it appears that the Linux conversions are consistent with those tables, while the libiconv uses visually similar characters, but with different codes from the ones found in the aforementioned tables.

 

As far as I can tell the issue happens with following characters:

 

CP932  0x8160 -> Output [0x301C], Expected [0xFF5E]  // Wavy dash

CP932  0x8161 -> Output [0x2016], Expected [0x2225]  // Vertical double line

CP932  0x817C -> Output [0x2212], Expected [0xFF0D]  // A dash

CP932  0x8191 -> Output [0x00A2], Expected [0xFFE0]  // Cent sign

CP932  0x8192 -> Output [0x00A3], Expected [0xFFE1]  // Pound sign

CP932  0x81CA -> Output [0x00AC], Expected [0xFFE2]  // Logical "not" sign

 

To easily reproduce the problem for individual characters I used the following test program:

(exactly the same cpp file compiled and ran on Ubuntu 16.10, and El Capitan but got two different outputs)

 

#include <stdio.h>

#include <iconv.h>

#include <errno.h>

 

void printerror()

{

    switch(errno)

    {

    case EILSEQ:

        printf("Invalid multibyte sequence in input\n");

        break;

    case EINVAL:

        printf("Incomplete multibyte sequence in input\n");

        break;

    case E2BIG:

        printf("Output buffer is out of room\n");

        break;

    default:

        printf("Generic Error\n");

        break;

    }

}

 

 

//

// Testing conversion of a CP932 character (0x8160) to UTF16 (Little Endian)

//

int main()

{

    const int SRCBYTES = 3;

    const int OUTBYTES = 4;

   

    iconv_t conv = iconv_open("UTF-16LE", "CP932");

   

    char a_source[SRCBYTES] = {};

    char a_output[OUTBYTES] = {};

   

    size_t i_sourcelen = SRCBYTES;

    size_t i_outputlen = OUTBYTES;

   

    //CP932 character 0x8160 - FULLWIDTH TILDE

    a_source[0] = 0x81;

    a_source[1] = 0x60;

    a_source[2] = 0;

   

    char* p_source = &a_source[0];

    char* p_output = &a_output[0];

 

    int* p_codepoint = (int*)p_output;

    int ret = iconv(conv, &p_source, &i_sourcelen, &p_output, &i_outputlen);

 

    if(ret != -1)

    {

        printf("srcbytes [%ld]\noutbytes [%ld]\n", i_sourcelen, i_outputlen);

       

        // Final output should be 0xFF5E according to various tables, such as

        // http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

        printf("UTF16 code point [0x%X]\n", *p_codepoint);

    }

    else

    {

        printerror();

    }

 

    iconv_close(conv);

   

    return 0;

}

 

Please let me know if this is expected behavior due to some factors I’m not aware of.

Thank you for your consideration.

 

Maxim Kouznetsov

Computer Scientist | Simba Technologies Inc. | A Magnitude Software Company
address@hidden

 


reply via email to

[Prev in Thread] Current Thread [Next in Thread]