[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#24603: [RFC 08/18] Support casing characters which map into multiple
bug#24603: [RFC 08/18] Support casing characters which map into multiple code points
Thu, 06 Oct 2016 23:40:11 +0200
Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/220.127.116.11 (x86_64-unknown-linux-gnu)
On Tue, Oct 04 2016, Eli Zaretskii wrote:
>> From: Michal Nazarewicz <address@hidden>
>> Date: Tue, 4 Oct 2016 03:10:31 +0200
>> * src/make-special-casing.py: New script to generate special-casing.h
>> file from the SpecialCasing.txt data file.
> Please do this without Python, either in Emacs Lisp and/or the tools
> already used in admin/unidata, including awk. Python is still not
> available as widely as the other tools.
>> +special-casing.h: make-special-casing.py ../admin/unidata/SpecialCasing.txt
>> + $(AM_V_GEN)
>> + python $^ $@
> Don't use a literal name of a program, so users could specify their
> name and/or absolute file name at build time. See what we do with
> awk, for example.
>> +#include "special-casing.h"
> Why not a shorter 'casing.h'?
It includes data from SpecialCasing.txt only so I figured
‘special-casing.h’ would be a more descriptive name. I can change it to
‘casing.h’ if you prefer.
> Once again, this stores the casing rules in C, whereas I'd prefer to
> have them in tables accessible from Lisp.
There are a few reasons to hard-code the special casing rules in C.
Some of them have conditions (does are implemented in later patches)
which are non-trivial to encode in Lisp. Some look backwards
(e.g. After_Soft_Dotted) and some look forward (e.g. Not_Before_Dot) and
not necessarily only one character forward (e.g. More_Above).
By hard-coding the implementation, each of the predicates can be handled
in a custom way such that the code only ever looks at current and one
character forward. Not to mention that is likely faster.
Furthermore, by not having the data in Lisp I can make certain
assumptions. For example that a single character will get changed into
a sequence of at most six bytes. Having to deal with arbitrary data
that user may have put in the lisp data would further complicate the
code and if the flexibility is not worth it.
There is also the aspect that not all of the language-dependent rules
implemented in this patchset are part of Unicode. Dutch IJ (when
spelled as separate ASCII characters) is not covered by
SpecialCasing.txt. Similarly, I might also get around to implementing
Irish rules¹. Mixing information from SpecialCasing.txt and other
sources feels a bit messy.
>> @@ -194,7 +276,9 @@ casify_object (enum case_action flag, Lisp_Object obj)
>> DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0,
>> doc: /* Convert argument to upper case and return that.
>> The argument may be a character or string. The result has the same type.
>> -The argument object is not altered--the value is a copy.
>> +The argument object is not altered--the value is a copy. If argument
>> +is a character, characters which map to multiple code points when
>> +cased, e.g. ﬁ, are returned unchanged.
>> See also `capitalize', `downcase' and `upcase-initials'. */)
> I think this doc string should say what to do if the application wants
> to convert ﬁ into "FI".
Perhaps it would be better to describe it in Info page and link that
from the docstrings? The reason I’m suggesting that is that there are
11 functions defined in src/casefiddle.c and a lot of the documentation
like that (some of which upcoming in future patches) should be included
in all of them but that would mean either repeating the same thing over
and over or linking to one particular function, but then which one
should be the special one? If all of this was moved to Info page and it
linked from docstring, the problem would go away.
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»
bug#24603: [RFC 18/18] Fix case-fold-search character class matching, Michal Nazarewicz, 2016/10/03
bug#24603: [RFC 17/18] Optimise character class matching in regexes, Michal Nazarewicz, 2016/10/03
bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings, Michal Nazarewicz, 2016/10/03
bug#24603: [RFC 08/18] Support casing characters which map into multiple code points, Michal Nazarewicz, 2016/10/03
bug#24603: [PATCH 0/3] Case table updates, Michal Nazarewicz, 2016/10/17
- bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data, (continued)