[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#24603: [RFC 08/18] Support casing characters which map into multiple
From: |
Eli Zaretskii |
Subject: |
bug#24603: [RFC 08/18] Support casing characters which map into multiple code points |
Date: |
Fri, 07 Oct 2016 10:46:08 +0300 |
> From: Michal Nazarewicz <mina86@mina86.com>
> Cc: 24603@debbugs.gnu.org
> Date: Thu, 06 Oct 2016 23:40:11 +0200
>
> >> +#include "special-casing.h"
> >
> > Why not a shorter 'casing.h'?
>
> It includes data from SpecialCasing.txt only so I figured
> ‘special-casing.h’ would be a more descriptive name. I can change it to
> ‘casing.h’ if you prefer.
Shorter names are easier to deal with. Also, the "special" part might
beg the question: where's the "normal" part. But it's a minor nit,
admittedly. If you feel strongly about your name, I won't fight that.
> > Once again, this stores the casing rules in C, whereas I'd prefer to
> > have them in tables accessible from Lisp.
>
> There are a few reasons to hard-code the special casing rules in C.
>
> Some of them have conditions (does are implemented in later patches)
> which are non-trivial to encode in Lisp. Some look backwards
> (e.g. After_Soft_Dotted) and some look forward (e.g. Not_Before_Dot) and
> not necessarily only one character forward (e.g. More_Above).
>
> By hard-coding the implementation, each of the predicates can be handled
> in a custom way such that the code only ever looks at current and one
> character forward. Not to mention that is likely faster.
>
> Furthermore, by not having the data in Lisp I can make certain
> assumptions. For example that a single character will get changed into
> a sequence of at most six bytes. Having to deal with arbitrary data
> that user may have put in the lisp data would further complicate the
> code and if the flexibility is not worth it.
It doesn't have to be arbitrary Lisp data. It could be just a set of
flags stored in a Lisp structure whose implementation is in C.
It's IMO okay to have this hard-coded in C, if a Lisp based
implementation would be unreasonably complex and inelegant. But I
don't see it should be quite yet; maybe I'm missing something. May I
suggest that you try designing this, and if it turns out to be too
cumbersome, come back with the evidence?
> There is also the aspect that not all of the language-dependent rules
> implemented in this patchset are part of Unicode. Dutch IJ (when
> spelled as separate ASCII characters) is not covered by
> SpecialCasing.txt.
The way we deal with such augmentations is by having most of the data
auto-generated, and some of it maintained manually. One example is
the current characters.el and charscript.el it loads. Can we use a
similar approach in this case? Experience shows that maintaining
everything manually is error-prone and a huge maintenance head-ache in
the long run, what with a new version of the Unicode Standard
available at least once a year.
> >> @@ -194,7 +276,9 @@ casify_object (enum case_action flag, Lisp_Object obj)
> >> DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0,
> >> doc: /* Convert argument to upper case and return that.
> >> The argument may be a character or string. The result has the same type.
> >> -The argument object is not altered--the value is a copy.
> >> +The argument object is not altered--the value is a copy. If argument
> >> +is a character, characters which map to multiple code points when
> >> +cased, e.g. fi, are returned unchanged.
> >> See also `capitalize', `downcase' and `upcase-initials'. */)
> >
> > I think this doc string should say what to do if the application wants
> > to convert fi into "FI".
>
> Perhaps it would be better to describe it in Info page and link that
> from the docstrings?
Fine with me.
Thanks.
- bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data, (continued)
bug#24603: [RFC 18/18] Fix case-fold-search character class matching, Michal Nazarewicz, 2016/10/03
bug#24603: [RFC 17/18] Optimise character class matching in regexes, Michal Nazarewicz, 2016/10/03
bug#24603: [RFC 10/18] Implement Turkic dotless and dotted i handling when casing strings, Michal Nazarewicz, 2016/10/03
bug#24603: [RFC 08/18] Support casing characters which map into multiple code points, Michal Nazarewicz, 2016/10/03
bug#24603: [PATCH 0/3] Case table updates, Michal Nazarewicz, 2016/10/17