[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
seq feature: print letters
From: |
assafgordon |
Subject: |
seq feature: print letters |
Date: |
Mon, 30 Jun 2014 10:23:58 +0000 |
User-agent: |
Heirloom mailx 12.5 6/20/10 |
Hello,
I'd like to suggest a patch to allow seq to generate letter sequences.
With this patch, 'seq' can print letters of alphabets in the current locale
(or user-specified language). Examples:
# print all letters in the current alphabet
seq --alphabet
seq -a
# print the first 10 letters in the current alphabet
seq -a 10
# print the fifth to tenth letters of the current alphabet
seq -a 5 10
# print the letters of the Russian alphabet
# (assuming the locale is installed)
LC_ALL=ru_RU.utf-8 seq -a
# print the letters of the hebrew alphabet
# (assuming the current locale supports UTF-8 or
# other encoding supported by gnulib/libunistring)
seq --alphabet=he
More details follow:
This has been suggested before, and there were several hurdles:
1. How to handle non C locales (with letters beyond the 7-bit ASCII)
2. How to handle EBCDIC (or other standard were the letters are not sequential
in their ordinal values)
3. How to handle input letters (eg seq from "à" to "ö").
I believe the following patch can address these issues.
1. "Seq" in alphabet mode will deal with defined sequences of letters in each
language:
Instead of dealing with numeric codes of letters (e.g. From 65=A to 69=E),
it deal with first letter in language EN to fifth letter in language EN (or any
other language).
2. Unicode/CLDR already maintains a list of official letters of the alphabet
for each language. Note that the list is not the same as "isalpha()": it only
contains the list of letters in the official alphabet.
Example:
In English/EN, the list contains A-Z, as expected.
In French/FR, the list still contains just A-Z - those are the official letters
in the French alphabet, while acute accent, grave accent and circumflex letters
( é è à â etc) are only considered diacritics, not stand-alone letters.
In Swedish/SV, the list contains A-Z plus å ä ö, while à é are considered
diacritics.
Similar lists are maintained for each language in the Unicode a database.
3. Internally, seq will store the list of letters for each language as UTF-8 -
this will avoid ambiguity, and gnulib's function will provide conversion to the
current encoding.
If this approach is acceptable, then we can plan for further features, such as:
4. Allow multi-character output, eg, with English, after "z" wrap to "aa",
"ab", etc.
5. Allow specifying start/end with letters instead of numbers ( eg "seq --abc
é z"), and apply collating rules to find which character in the alphabet to
start from.
6. The language database (in './src/alphabet.c') is not perfect. It was
automatically generated by extracting infomration from the Unicode/CLDR XML
files. For some language there are obvious errors (such as characters
incorrectly converted from designation such "\u093C"). For other language, the
code used by unicode is not necessarily compatible with the locale name. But
for most language I believe the information is valid, and for the few incorrect
definitions, I think they could be easily fixed by manual inspection.
Comments are welcomed,
- Gordon
P.S.
Adding few gnulib modules to 'bootstrap.conf' uncovered possible minor issues:
1. string_hash() in 'localename.c' requires _GL_ATTRIBUTE_PURE
2. gl_locale_name_default() in 'localename.c' requires _GL_ATTRIBUTE_CONST
3. striconveh.c triggers the following warning:
CC lib/striconveh.o
In file included from lib/striconveh.c:34:0:
lib/c-strcaseeq.h: In function 'strcaseeq7.constprop.10':
lib/c-strcaseeq.h:47:23: error: array subscript is above array bounds
[-Werror=array-bounds]
cc1: all warnings being treated as errors
seq_letters_2014_06_30.patch.xz
Description: Binary data
- seq feature: print letters,
assafgordon <=