coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: seq feature: print letters


From: Assaf Gordon
Subject: Re: seq feature: print letters
Date: Mon, 30 Jun 2014 13:39:52 -0600

>> On Jun 30, 2014, at 5:24, Pádraig Brady <address@hidden> wrote:
>> 
>> On 06/30/2014 11:23 AM, address@hidden wrote:
>> I'd like to suggest a patch to allow seq to generate letter sequences.
> I notice about 45 copies of the A-Z alphabet, would it be worth introducing 
> aliases to avoid copies?

Yes, we can consolidate them.

> What about case. The current code only has upper case. case is a can of worms 
> I know, with not necessarily 1:1 mapping etc.

Once leaving the realm of latin languages, upper/lower case indeed becomes very 
complicated. Or even meaningless. I thought that 'tr [:upper:] [:lower:]' would 
handle it better (but I now realize tr doesn't support UTF-8 well, if I 
understand correctly).

I think that for the first step, we should not deal with upper/lower case 
issues.

> The data being leveraged is well defined at present reasonable to include 
> directly in the seq binary (about 12K I'm guessing),
> though have you looked at whether libunistring contains the appropriate 
> data/logic for this?
> This might be more significant if case or more characters were considered for 
> example.

This first draft stores UTF-8 strings (with NUL) for each character.  I saw the 
libunistring code stores some bit-fields for some of the functions, though I 
haven't learned it yet.
I will try to improve the storage method in following patches.

> I had a quick look at the CLDR. Are you only considering the "Index exemplar" 
> chars here?
> http://www.unicode.org/cldr/charts/25/by_type/core_data.alphabetic_information.index.html

Exactly.

> Maybe it would be better to default to the "standard exemplars"?
> http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters

The reason I liked to "index" list, is because it most directly answers the 
question "what is the alphabet in language X" ? (is in, what are the letters 
that would be taught in schools as "the alphabet", or if you ask a person on 
the street to list the alphabet letters).
It also lends itself to do:
   # How many letters are in the Arabic alphabet:
    seq --alphabet=ar | wc -l
   # What is the eleventh letter in the Russian alphabet:
    seq --alphabet=ru | awk 'NR==11'

Technically, the functionality of "is_alpha()" does not correspond 1:1 to "the 
alphabet", which is part of the problem... In English, there are no 
complications, but in many other languages it becomes complicated.

Using other Unicode categories (e.g.the 'main' letters or even 'auxiliary' 
letters) answers a slightly different question, more akin to "what symbols are 
acceptable in language X ?" - not a bad question, just different that the 
previous question.

For example in Hebrew, the "index" list contains 22 letters (which agrees with 
the question "how many letters are in the Hebrew alphabet"), but the 
"main/standard" list has 5 more symbols, of 5 hebrew letters that have specific 
"final" form (if those letters appear at the end of the word).
So using the "main" list would list 5 letters twice. I believe other language 
such as Arabic would present similar issues.

From a technical point of view, it's easy to include both "index" and 
"standard" letters (with different command-line options), it's just a matter of 
adding more lists.

What do you think?

Thanks,
 -Gordon




reply via email to

[Prev in Thread] Current Thread [Next in Thread]