coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

seq feature: print letters


From: assafgordon
Subject: seq feature: print letters
Date: Mon, 30 Jun 2014 10:23:58 +0000
User-agent: Heirloom mailx 12.5 6/20/10

Hello,

I'd like to suggest a patch to allow seq to generate letter sequences.

With this patch, 'seq' can print letters of alphabets in the current locale
(or user-specified language). Examples:

    # print all letters in the current alphabet
    seq --alphabet
    seq -a
    # print the first 10 letters in the current alphabet
    seq -a 10
    # print the fifth to tenth letters of the current alphabet
    seq -a 5 10
    # print the letters of the Russian alphabet
    # (assuming the locale is installed)
    LC_ALL=ru_RU.utf-8 seq -a
    # print the letters of the hebrew alphabet
    # (assuming the current locale supports UTF-8 or
    #  other encoding supported by gnulib/libunistring)
    seq --alphabet=he


More details follow:

This has been suggested before, and there were several hurdles:
1. How to handle non C locales (with letters beyond the 7-bit ASCII)
2. How to handle EBCDIC (or other standard were the letters are not sequential 
in their ordinal values)
3. How to handle input letters (eg seq from "à" to "ö").

I believe the following patch can address these issues.

1. "Seq" in alphabet mode will deal with defined sequences of letters in each 
language:
Instead of dealing with numeric codes of letters (e.g. From 65=A to 69=E),
it deal with first letter in language EN to fifth letter in language EN (or any 
other language).

2. Unicode/CLDR already maintains a list of official letters of the alphabet 
for each language. Note that the list is not the same as "isalpha()": it only 
contains the list of letters in the official alphabet.

Example:
In English/EN, the list contains A-Z, as expected.
In French/FR, the list still contains just A-Z - those are the official letters 
in the French alphabet, while acute accent, grave accent and circumflex letters 
( é è à â etc) are only considered diacritics, not stand-alone letters.
In Swedish/SV, the list contains A-Z plus å ä ö, while à é are considered 
diacritics.
Similar lists are maintained for each language in the Unicode a database.

3. Internally, seq will store the list of letters for each language as UTF-8 - 
this will avoid ambiguity, and gnulib's function will provide conversion to the 
current encoding.

If this approach is acceptable, then we can plan for further features, such as:

4. Allow multi-character output, eg, with English, after "z" wrap to "aa", 
"ab", etc.

5.  Allow specifying start/end with letters instead of numbers ( eg "seq --abc 
é z"), and apply collating rules to find which character in the alphabet to 
start from.

6. The language database (in './src/alphabet.c') is not perfect. It was 
automatically generated by extracting infomration from the Unicode/CLDR XML 
files. For some language there are obvious errors (such as characters 
incorrectly converted from designation such "\u093C"). For other language, the 
code used by unicode is not necessarily compatible with the locale name. But 
for most language I believe the information is valid, and for the few incorrect 
definitions, I think they could be easily fixed by manual inspection.

Comments are welcomed,
  - Gordon

P.S.
Adding few gnulib modules to 'bootstrap.conf' uncovered possible minor issues:
1. string_hash() in 'localename.c' requires _GL_ATTRIBUTE_PURE
2. gl_locale_name_default() in 'localename.c' requires _GL_ATTRIBUTE_CONST
3. striconveh.c triggers the following warning:  
CC     lib/striconveh.o
In file included from lib/striconveh.c:34:0:
lib/c-strcaseeq.h: In function 'strcaseeq7.constprop.10':
lib/c-strcaseeq.h:47:23: error: array subscript is above array bounds 
[-Werror=array-bounds]
cc1: all warnings being treated as errors

Attachment: seq_letters_2014_06_30.patch.xz
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]