[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Multibyte support (round 2)
From: |
Assaf Gordon |
Subject: |
Multibyte support (round 2) |
Date: |
Sat, 27 Aug 2016 01:05:05 -0400 |
Hello,
Attached is a second attempt at adding multibyte support for coreutils.
(continued from
http://lists.gnu.org/archive/html/coreutils/2016-07/msg00013.html).
Of course this is just a rough draft, basis for discussion - not final in any
way.
It includes four commits:
1.
New module "multibyte" - system-dependent definitions, and multibyte detection
code.
Based on code that repeated itself in the current i18n patch.
Also includes "src/multibyte-test" program to test the detection code
(of course not a final location for this executable).
2.
New module "mbbuffer" - provides a convenient interface to reading multibyte
input data (from either fread(3) or read(2)), using fixed-size buffer,
calling mbrtowc and handling all cases. Also takes care of counting
lines/column positions.
Includes test program "src/mbbuffer-test" to test the buffering code.
3.
The "unorm" program (as previously discussed), now uses "mbbuffer" module
and the code is smaller and cleaner.
Still assumes wchar_t == UCS4 , but see details below regarding
why I think that's an acceptable assumption.
4.
As a proof-of-concept, 'expand' with initial multibyte support,
with the multibyte code being very similar to the single-byte code.
Currently zero-width glyphs and combining chars are not handled.
====
Regarding wchar_t == UCS:
1. 'unorm' only uses the wchar_t value directly if unicode normalization
is requested (otherwise, it prints the multibyte octets as-is).
2. If normalization is requested, I think it's safe to assume the
locale is unicode-related (e.g. *unicode* normalization under
iso88591/shift-jis/Big5/eucJP locales is not meaningful).
3. For now, I'm assuming unicode-supporting locales are de-facto UTF-8,
but I suspect this can be relaxed if needed.
And so, the question becomes:
When the locale is "UTF-8", is the internal representation of 'wchar_t'
identical to UCS2 or UCS4 (i.e. unicode code-points).
While the standard explicitly says this can not be assumed,
I think in practice it is always the case.
It is so in glibc and musl-libc,
and in OpenBSD,FreeBSD,NetBSD with "UTF-8" locales (but not in non-utf8
locales).
In OpenSolaris with unicode locales, wchar_t is UTF-32
(https://docs.oracle.com/cd/E36784_01/html/E39536/gmwkm.html ).
For AIX, wchar_t is either UCS2 or UCS4 in unicode locales (for 32bit/64bit
binaries respectively, see
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_53/com.ibm.aix.nls/doc/nlsgdrf/codeset_over.htm
)
I'd be very interested to learn about more systems, but I hope this
un-standardize behavior is prevalent enough to be relied upon.
The current implementation of 'unorm' first checks if 'wchar_t==UCS4', and only
allows unicode-normalization if it is.
Comments very welcomed,
- assaf
Assaf Gordon (4):
build: multibyte: new module
build: mbbuffer: new module
unorm: a new program to fix and normalize multibyte files
expand: add multibyte support
AUTHORS | 1 +
README | 2 +-
bootstrap.conf | 7 +
build-aux/gen-lists-of-programs.sh | 1 +
doc/coreutils.texi | 20 +-
man/.gitignore | 1 +
man/local.mk | 1 +
man/unorm.x | 4 +
po/POTFILES.in | 1 +
scripts/git-hooks/commit-msg | 2 +-
src/.gitignore | 1 +
src/expand-common.c | 16 +-
src/expand-common.h | 5 +
src/expand.c | 144 ++++++++++-
src/local.mk | 19 +-
src/mbbuffer-test.c | 295 +++++++++++++++++++++
src/mbbuffer.c | 305 ++++++++++++++++++++++
src/mbbuffer.h | 176 +++++++++++++
src/multibyte-test.c | 92 +++++++
src/multibyte.c | 153 +++++++++++
src/multibyte.h | 101 ++++++++
src/unorm.c | 512 +++++++++++++++++++++++++++++++++++++
tests/local.mk | 2 +
tests/misc/expand-multibyte.pl | 106 ++++++++
tests/misc/expand.pl | 32 +++
tests/misc/unorm.pl | 178 +++++++++++++
26 files changed, 2171 insertions(+), 6 deletions(-)
create mode 100644 man/unorm.x
create mode 100644 src/mbbuffer-test.c
create mode 100644 src/mbbuffer.c
create mode 100644 src/mbbuffer.h
create mode 100644 src/multibyte-test.c
create mode 100644 src/multibyte.c
create mode 100644 src/multibyte.h
create mode 100644 src/unorm.c
create mode 100755 tests/misc/expand-multibyte.pl
create mode 100755 tests/misc/unorm.pl
multibyte-2016-08-27.patch.xz
Description: Binary data
- Multibyte support (round 2),
Assaf Gordon <=