[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: seq feature: print letters
From: |
Assaf Gordon |
Subject: |
Re: seq feature: print letters |
Date: |
Tue, 08 Jul 2014 23:01:34 -0400 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 |
Hello,
On 06/30/2014 06:23 AM, address@hidden wrote:
I'd like to suggest a patch to allow seq to generate letter sequences.
Attached is an improved implementation for the same functionality:
( http://lists.gnu.org/archive/html/coreutils/2014-06/msg00090.html )
With this patch, 'seq' can print letters of alphabets in the current locale
(or user-specified language). Examples:
# print all letters in the current alphabet
seq --alphabet
seq -a
# print the first 10 letters in the current alphabet
seq -a 10
# print the letters of the Russian alphabet
# (assuming the locale is installed)
LC_ALL=ru_RU.utf-8 seq -a
# print the letters of the hebrew alphabet
# (assuming the current locale supports UTF-8 or
# other encoding supported by gnulib/libunistring)
seq --alphabet=he
The new data takes ~5100 bytes (instead of previous >15KB).
It requires (one time) encoding of a 'database' textual file (included) using a
perl script (included).
Conceptually similar to the unicode tables, this only needs to be done when an
alphabet is updated.
The alphabets are encoded in 'src/alphabets_data.h'.
The decoder is in 'src/alphabets.{c,h}' .
The added functionality is in few new functions in 'src/seq.c' .
===
If you think that this is an acceptable feature (at least conceptually), then
I'd be happy to discuss further details,
such as which languages to include, and implementation suggestions (for
example, should this be moved to gnulib?).
Are there any important encoding issues I might have missed (the code tries to
be as portable as possible, internally storing UCS values, converting them to
UTF8 with 'u8-uctomb()', then printing them with 'u8-strconv-to-locale()' - so
no assumption about the active encoding).
Should there be an interface for multi-letter output (e.g. "aa" after "z"),
===
Regarding Bernhard's comment:
On 07/03/2014 02:18 AM, Bernhard Voelker wrote:
The user could let the shell produce the input:
$ printf "%c" {a..z} | seq -s ' ' --alpha=- 2 2 6
b d f
thus picking the Nth character from the input. ;-)
I don't think this example is portable, as "{a..z}" is not in POSIX sh, so
can't be used in scripting.
However, more generally, it's easy to generate ranges of unicode symbols if
their value is known:
# Arabic letters (unicode block 0x627 - 0x64a)
seq $((0x627)) $((0x64a)) | xargs env printf '\\\\u%04x\\\\n' | xargs env
printf
# Cyrillic letters (unicode block 0x410 - 0x42f)
seq $((0x410)) $((0x42f)) | xargs env printf '\\\\u%04x\\\\n' | xargs env
printf
But the problem is that official alphabets letters for each language are very
irregular:
For example, few letters in the Arabic block aren't official ordinal letters
(they are valid alphabet symbols
for letter under certain conditions).
Also, in some languages, a letter is actually two unicode symbols (e.g. in Czech, "Ch" is a single
letter, in addition to the "C" and "H" letters).
In non-english latin based languages, besides the simple ASCII letters of A-Z,
there are additional symbols which are not sequential unicode values.
Whether this feature is desired or not in coreutils is one question. But if it is (for
more languages than English), then I think simple "ranges" will not suffice.
Comments are welcomed,
-gordon
seq_alphabet.2014-07-08.patch.xz
Description: application/xz