[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: multibyte support (round 4)
From: |
Pádraig Brady |
Subject: |
Re: multibyte support (round 4) |
Date: |
Sat, 8 Apr 2017 17:44:26 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
On 08/04/17 01:58, Assaf Gordon wrote:
> Hello,
>
> I think that we've handled the low-hanging fruits (e.g. expand/cut/fold) when
> it
> comes to multibyte support in coreutils.
> The remaining programs (e.g. sort,join,uniq,tr,od) present some challenges -
> both in terms of what is the 'correct' (and useful) behavior,
> and in terms of implementation.
>
> I also think a common thread is the combination of these three requirements:
> 1. Invalid sequences must be handled as single-bytes
> 2. Can't rely on native wchar_t (e.g. for cygwin) without extra work
> 3. Can't assume UTF-8 (or even unicode).
>
> Each requirement by itself is not too problematic - but combined
> they make a portable and efficient implementation quite cumbersome.
>
> I'd like to ask a heretical question:
> what if we can relax these requirements ?
> specifically, what if we can agree that on systems where wchar_t
> is not sufficient, we only support UTF8 (and thus use gnulib's internal fast
> implementations)?
> (I would love to suggest to support only utf8 everywhere, but I'm sure this
> would not be accepted...)
Well even utf8 only is better than the current upstream situation,
though that would probably preclude Red Hat and Suse from just talking
the upstream code, so probably best to support other encodings.
Though UTF-8 definitely should have preferential treatment.
I think it's a totally valid compromise to not support other encodings
on systems without adequate wchar_t.
> I will continue to work on multibyte support in any case,
> but I think it will make things much better if we are not tied by these
> (legacy?) issues.
>
> With a bit of hand-waving, wouldn't it be reasonable to say that the largest
> portion of GNU coreutils users have systems that have both useable wchar_t
> *and* work primarily in UTF-8 ?
The vast majority yes. This is a totally fine compromise.
> At the risk of mixing apples and oranges, checking the encoding for web-sites
> shows that
> UTF-8 is clearly dominating over time:
> https://w3techs.com/technologies/details/en-utf8/all/all
> http://pinyin.info/news/2015/utf-8-unicode-vs-other-encodings-over-time/
> I know coreutils is not meant for the web, but I hope that it does hint that
> UTF-8 is gaining popularity not only in websites.
>
> Looking at other implementations, some chose to switch to UTF-8 completely
> (e.g. OpenBSD-6, or Linux with musl-libc). Others have useable wchar_t and
> support multibyte processing for a long time (e.g. FreeBSD, Mac OS X).
>
> I have skimmed through past mailing-list discussions, and Eric has been
> replying since about 2006 saying essentially "if someone comes up with
> efficient implementation we'll add it" - but despite many attempts - we still
> don't have it.
>
> It won't be a regression for these few limited systems - because currently
> coreutils doesn't provide any multibyte support.
Another thing to consider is that each tool doesn't have to have a full
stack of encoding considerations. I.E. you could/should have other tools
convert to/from legacy encodings. That also applies to error handling
and normalization as previously discussed. I.E. have a single tool
to cater for these, leaving other tools to focus on just their functions.
> Lastly,
> I've arranged my notes into a web page.
> I hope these notes will save some time if others are interested in
> catching-up to the multibyte issue (except for the time it'll take to read my
> notes (-: ) :
> http://crashcourse.housegordon.org/coreutils-multibyte-support.html
Wow lots of notes there.
Many thanks for sharing them.
cheers,
Pádraig