nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects


From: Ralph Corderoy
Subject: Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects
Date: Tue, 17 Jun 2014 18:56:48 +0100

Hi Norm,

> So you are saying that "normal unix commands", such as grep, wc, tr
> etc, do or someday the GNU versions will, know about UTF-8, at least
> for file contents,

Yes, they do, today.  And have done for quite a while.  You need your
environment variables set up properly so `locale' reports UTF-8 (or
`utf8').  Then...

    $ grep -i roman chars
    Roman numerals Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ Ⅶ Ⅷ Ⅸ Ⅹ Ⅺ Ⅻ Ⅼ Ⅽ Ⅾ Ⅿ
    $ grep £ chars
    Currency £ € cent-¢
    $ grep -i roman chars | sed -r 's/.*(.)/\1/'
    Ⅿ
    $ grep -i roman chars | sed -r 's/.*(.)/\1/' | hd
    00000000  e2 85 af 0a                                       |....|
    00000004
    $ 

> if not for file names?

The Unix kernel stores filenames as a run of bytes, not including `/'
and NUL.  It places no interpretation on them itself.  Userspace is able
to do so, but two users might see different names for the same file just
as they might `see' the same text file differently if they think the
bytes represent different encodings.

    $ >pound-£
    $ ls
    pound-£
    $ LC_ALL=C ls
    pound-??
    $ 

But really, these days, the whole world is UTF-8.  Unless it's Microsoft
with their backwards backwards-compatibility view of the world, and no
one cares about them.

Cheers, Ralph.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]