[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Wide and UTF-8 international characters
From: |
Thomas Dickey |
Subject: |
Re: Wide and UTF-8 international characters |
Date: |
Sat, 17 May 2003 19:10:23 -0400 |
User-agent: |
Mutt/1.2.5i |
On Sat, May 17, 2003 at 04:25:21PM -0600, D. Stimits wrote:
> So it sounds like the 8th bit is no longer used as a flag...is that
> correct? But also that 1 or more bytes are then added with each
> character cell to provide attribute data...is that correct?
yes.
> I assume that the actual character then is always converted to a wide
> character, even if it is just common text not requiring a wide character
> (because it is easier to deal with uniform wide characters than
> varying-width multibyte representations with escape sequences to mark
> character set changes). How many bytes does the current ncurses use to
> store non-attribute character data? I would guess two 8-bit bytes
> internally per cell.
for wide-characters, more than that: it has to allow for combining characters
(more than one ;-). The attributes are stored separately:
#define CCHARW_MAX 5
typedef struct
{
attr_t attr;
wchar_t chars[CCHARW_MAX];
}
cchar_t;
> > that was up til mid-2001 - I didn't quite know where to begin at
> > rewriting,
> > but one of the contributors got it moving. ncurses 5.3 was good enough to
> > use - the current code probably has isolated bugs, but I don't see any
> > that are related to wide-characters. Not all functions are tested - so
> > I've been reviewing, adding test-programs for places that are noticeably
> > not covered.
>
> Currently on Linux, I could display a copyright symbol ('c' inside of a
> circle) by outputting 169 decimal cast as character (8 bits) to the
> terminal. I'm looking at the man page for echochar, and it appears that
> ncurses came up with its own version of something similar to html/xml
> character entities, but the ncurses version is not as complete as
> html/xml entities. If I were to use a printw function with a %c format,
> feeding it 169 decimal (or anything from 128 through 255), will ncurses
> ever represent the output appearance differently than had I fed that
> decimal number (cast as 8 bit character) directly to a standard linux
> console or xterm?
yes/no: the actual value written to the terminal depends on the locale.
169 is the Latin-1 (ISO-8859-1) code for copyright. If your locale is one of
the ones that uses 8-bit characters, there's no real difference. If it's one
that uses UTF-8, the ISO-8859-1 values are represented internally the same, but
written differently depending on the locale. UTF-8 uses the range from 128-255
differently.
--
Thomas E. Dickey <address@hidden>
http://invisible-island.net
ftp://invisible-island.net