coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: seq feature: print letters


From: Pádraig Brady
Subject: Re: seq feature: print letters
Date: Mon, 30 Jun 2014 12:24:33 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 06/30/2014 11:23 AM, address@hidden wrote:
> Hello,
> 
> I'd like to suggest a patch to allow seq to generate letter sequences.
> 
> With this patch, 'seq' can print letters of alphabets in the current locale
> (or user-specified language). Examples:
> 
>     # print all letters in the current alphabet
>     seq --alphabet
>     seq -a
>     # print the first 10 letters in the current alphabet
>     seq -a 10
>     # print the fifth to tenth letters of the current alphabet
>     seq -a 5 10
>     # print the letters of the Russian alphabet
>     # (assuming the locale is installed)
>     LC_ALL=ru_RU.utf-8 seq -a
>     # print the letters of the hebrew alphabet
>     # (assuming the current locale supports UTF-8 or
>     #  other encoding supported by gnulib/libunistring)
>     seq --alphabet=he
> 
> 
> More details follow:
> 
> This has been suggested before, and there were several hurdles:
> 1. How to handle non C locales (with letters beyond the 7-bit ASCII)
> 2. How to handle EBCDIC (or other standard were the letters are not 
> sequential in their ordinal values)
> 3. How to handle input letters (eg seq from "à" to "ö").
> 
> I believe the following patch can address these issues.
> 
> 1. "Seq" in alphabet mode will deal with defined sequences of letters in each 
> language:
> Instead of dealing with numeric codes of letters (e.g. From 65=A to 69=E),
> it deal with first letter in language EN to fifth letter in language EN (or 
> any other language).
> 
> 2. Unicode/CLDR already maintains a list of official letters of the alphabet 
> for each language. Note that the list is not the same as "isalpha()": it only 
> contains the list of letters in the official alphabet.
> 
> Example:
> In English/EN, the list contains A-Z, as expected.
> In French/FR, the list still contains just A-Z - those are the official 
> letters in the French alphabet, while acute accent, grave accent and 
> circumflex letters ( é è à â etc) are only considered diacritics, not 
> stand-alone letters.
> In Swedish/SV, the list contains A-Z plus å ä ö, while à é are considered 
> diacritics.
> Similar lists are maintained for each language in the Unicode a database.
> 
> 3. Internally, seq will store the list of letters for each language as UTF-8 
> - this will avoid ambiguity, and gnulib's function will provide conversion to 
> the current encoding.
> 
> If this approach is acceptable, then we can plan for further features, such 
> as:
> 
> 4. Allow multi-character output, eg, with English, after "z" wrap to "aa", 
> "ab", etc.
> 
> 5.  Allow specifying start/end with letters instead of numbers ( eg "seq 
> --abc é z"), and apply collating rules to find which character in the 
> alphabet to start from.
> 
> 6. The language database (in './src/alphabet.c') is not perfect. It was 
> automatically generated by extracting infomration from the Unicode/CLDR XML 
> files. For some language there are obvious errors (such as characters 
> incorrectly converted from designation such "\u093C"). For other language, 
> the code used by unicode is not necessarily compatible with the locale name. 
> But for most language I believe the information is valid, and for the few 
> incorrect definitions, I think they could be easily fixed by manual 
> inspection.
> 
> Comments are welcomed,
>   - Gordon

I like it!
The interface is concise and fits seq well.
I see the jot util has similar functionality confirming the usefulness.
I notice about 45 copies of the A-Z alphabet, would it be worth introducing 
aliases to avoid copies?
What about case. The current code only has upper case. case is a can of worms I 
know, with not necessarily 1:1 mapping etc.
The data being leveraged is well defined at present reasonable to include 
directly in the seq binary (about 12K I'm guessing),
though have you looked at whether libunistring contains the appropriate 
data/logic for this?
This might be more significant if case or more characters were considered for 
example.
I had a quick look at the CLDR. Are you only considering the "Index exemplar" 
chars here?
  
http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html
Maybe it would be better to default to the "standard exemplars"?
  http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters

thanks!
Pádraig




reply via email to

[Prev in Thread] Current Thread [Next in Thread]