[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #63139] About string support for Chinese, Japa

From: Arun Giridhar
Subject: [Octave-bug-tracker] [bug #63139] About string support for Chinese, Japanese and Korean characters
Date: Fri, 30 Sep 2022 08:57:52 -0400 (EDT)

Follow-up Comment #1, bug #63139 (project octave):

This is not CJK-specific but a different internal representation of Unicode
text. AFAIK Octave uses UTF-8, meaning that a string of Unicode characters
becomes a byte stream. Each of your two Chinese characters is represented in
24 bits of UTF-8 of which 16 bits are content and 8 bits are preset values as
described here: https://en.wikipedia.org/wiki/UTF-8#Encoding

You can verify the encoding is correct UTF-8 with these commands:

>> foo = dec2bin ('你' + 0, 8)'(:)'
foo = 111001001011110110100000
>> foo = dec2bin ('好' + 0, 8)'(:)'
foo = 111001011010010110111101
>> foo = dec2bin ('你好' + 0, 8)'(:)'
foo = 111001001011110110100000111001011010010110111101

>> foo = '你好'
>> whos foo
Variables visible from the current scope:

variables in scope: top scope

  Attr   Name        Size                     Bytes  Class
  ====   ====        ====                     =====  ===== 
         foo         1x6                          6  char

What does Matlab return for those commands? That will tell you what encoding
is used internally.

As far as I can see, this is not a bug, unless you get wrong results as a
consequence of different Unicode representations.


Reply to this item at:


Message sent via Savannah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]