[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
multibyte support (round 4) - tr
From: |
Assaf Gordon |
Subject: |
multibyte support (round 4) - tr |
Date: |
Mon, 11 Dec 2017 02:14:11 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 |
Hello,
Some progress with the multibyte support: partial multibyte processing
with 'tr'. currently only delete/squeeze work (and not efficiently).
translation and -C/-c differences are not implemented yet.
The patch is getting too big to attach, so it is available here:
https://files.housegordon.org/src/coreutils-multibyte-2017-12-11.patch.xz
(perhaps a non-master branch on the savannah git would be better?)
The patch includes all previous code, and the last four commits
are the 'tr' implementation. Below are commit messages with examples.
For those interested, past information is available here:
https://crashcourse.housegordon.org/coreutils-multibyte-support.html
comments welcomed,
- assaf
====
commit f6bb9c906eaf1644f18e66fedc211a8de91057d1
Author: Assaf Gordon <address@hidden>
Date: Fri Dec 8 21:39:45 2017 -0700
tr: add --debug option
Prints the content of the SET(s).
In future patches, print multibyte-related information.
Example:
$ ./src/tr --debug -d 'A-Z[:digit:]\250'
./src/tr: hard_LC_COLLATE: yes
./src/tr: operating mode: delete (-d)
./src/tr: set: set1
./src/tr: logical length: 37
./src/tr: indefinite repeats: no
./src/tr: has_equiv_class: no
./src/tr: has_char_class: yes
./src/tr: has_restricted_char_class: yes
./src/tr: SpecList:
./src/tr: RANGE: 'A'-'Z' (0x41 - 0x5a)
./src/tr: CHAR_CLASS: [:digit:]
./src/tr: NORMAL_CHAR: '' (0xa8)
commit c40e2aebe07f57b23614bb764959ece1f2156944
Author: Assaf Gordon <address@hidden>
Date: Fri Dec 8 23:37:36 2017 -0700
tr: support multibyte characters in SETs parameters
The typical tr command line is 'tr SET1 SET2'
(or 'tr -d SET1' 'tr -ds SET1 SET2' etc.)
Previously there were only 5 types of elements in SETs:
single character (=octet),
range
repeated character (=octet)
character class (e.g. [:alpha:])
equivalent class (e.g. [=e=])
This adds a new type of wide character.
These are stored only if:
1. The current locale supports multibyte characters
2. The multibyte sequence is valid
3. The sequence is indeed multibyte (single octets are stored
as before)
Multibyte characters can only be specified using new-style
shell-escapes in multibyte locales or entering the character directly:
LC_ALL=en_CA.UTF-8 tr -d $'\316\250'
LC_ALL=en_CA.UTF-8 tr -d 'Ψ'
Escape sequences (which are un-escaped by tr itself) are never treated
as multibyte characters. The following always deletes two octets
(\316 and \250) regardless of active locale:
tr -d '\316\250'
This is likely against POSIX, but discussed here:
https://lists.gnu.org/r/coreutils/2017-09/msg00028.html
commit 3083161add6a2f14a32718bf755cc2d3da2e8765
Author: Assaf Gordon <address@hidden>
Date: Sat Dec 9 00:58:05 2017 -0700
tr: optimize by skipping multibyte processing if possible
Under certain conditions it is safe to process the input as octets
instead of needed multibyte decoding and validation.
These conditions are discussed here (bottom of text):
https://lists.gnu.org/r/coreutils/2017-09/msg00028.html
An undocumented option (tr ---force-multibyte) disables the
optimization
in order to exercise the MB code path.
commit c5f812bab3602613e1140bd5d9e92d14097bc8dd
Author: Assaf Gordon <address@hidden>
Date: Sat Dec 9 02:22:11 2017 -0700
tr: implement multibyte delete/squeeze
The following examples work:
Delete character class:
$ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -d '[:alpha:]'
123
$ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -d '[:lower:]'
AЩΣ123ΠĚ
Delete + complement:
$ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -dc '[:lower:][:cntrl:]'
aщπfg
$ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -dc '[:upper:][:cntrl:]'
AЩΣΠĚ
$ echo "aAЩщΣ123ΠπfĚg" | ./src/tr -dc 'Σ'
Σ
Squeeze repeated characters:
$ echo "ЩЩЩщщщщ" | ./src/tr -s 'щ'
ЩЩЩщ
$ echo "aaaAAAAЩЩЩЩщщщщΠΠΠΠππππfĚg" \
| ./src/tr -s '[:lower:]'
aAAAAЩЩЩЩщΠΠΠΠπfĚg
Squeeze + complement:
$ echo "aaaAAAAЩЩЩЩщщщщΠΠΠΠππππfĚg" \
| ./src/tr -c -s '[:lower:]'
aaaAЩщщщщΠππππfĚg
Delete + Squeeze:
$ echo "aaaAAAAЩЩЩЩщщщщΠΠΠΠππππfĚg" \
| ./src/tr -d -s '[:upper:]' '[:lower:]'
aщπfg
- multibyte support (round 4) - tr,
Assaf Gordon <=