bug#24603: [RFC 08/18] Support casing characters which map into multiple

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24603: [RFC 08/18] Support casing characters which map into multiple

From:	Eli Zaretskii
Subject:	bug#24603: [RFC 08/18] Support casing characters which map into multiple code points
Date:	Fri, 07 Oct 2016 10:46:08 +0300

> From: Michal Nazarewicz <mina86@mina86.com>
> Cc: 24603@debbugs.gnu.org
> Date: Thu, 06 Oct 2016 23:40:11 +0200
> 
> >> +#include "special-casing.h"
> >
> > Why not a shorter 'casing.h'?
> 
> It includes data from SpecialCasing.txt only so I figured
> ‘special-casing.h’ would be a more descriptive name.  I can change it to
> ‘casing.h’ if you prefer.

Shorter names are easier to deal with.  Also, the "special" part might
beg the question: where's the "normal" part.  But it's a minor nit,
admittedly.  If you feel strongly about your name, I won't fight that.

> > Once again, this stores the casing rules in C, whereas I'd prefer to
> > have them in tables accessible from Lisp.
> 
> There are a few reasons to hard-code the special casing rules in C.
> 
> Some of them have conditions (does are implemented in later patches)
> which are non-trivial to encode in Lisp.  Some look backwards
> (e.g. After_Soft_Dotted) and some look forward (e.g. Not_Before_Dot) and
> not necessarily only one character forward (e.g. More_Above).
> 
> By hard-coding the implementation, each of the predicates can be handled
> in a custom way such that the code only ever looks at current and one
> character forward.  Not to mention that is likely faster.
> 
> Furthermore, by not having the data in Lisp I can make certain
> assumptions.  For example that a single character will get changed into
> a sequence of at most six bytes.  Having to deal with arbitrary data
> that user may have put in the lisp data would further complicate the
> code and if the flexibility is not worth it.

It doesn't have to be arbitrary Lisp data.  It could be just a set of
flags stored in a Lisp structure whose implementation is in C.

It's IMO okay to have this hard-coded in C, if a Lisp based
implementation would be unreasonably complex and inelegant.  But I
don't see it should be quite yet; maybe I'm missing something.  May I
suggest that you try designing this, and if it turns out to be too
cumbersome, come back with the evidence?

> There is also the aspect that not all of the language-dependent rules
> implemented in this patchset are part of Unicode.  Dutch IJ (when
> spelled as separate ASCII characters) is not covered by
> SpecialCasing.txt.

The way we deal with such augmentations is by having most of the data
auto-generated, and some of it maintained manually.  One example is
the current characters.el and charscript.el it loads.  Can we use a
similar approach in this case?  Experience shows that maintaining
everything manually is error-prone and a huge maintenance head-ache in
the long run, what with a new version of the Unicode Standard
available at least once a year.

> >> @@ -194,7 +276,9 @@ casify_object (enum case_action flag, Lisp_Object obj)
> >>  DEFUN ("upcase", Fupcase, Supcase, 1, 1, 0,
> >>         doc: /* Convert argument to upper case and return that.
> >>  The argument may be a character or string.  The result has the same type.
> >> -The argument object is not altered--the value is a copy.
> >> +The argument object is not altered--the value is a copy.  If argument
> >> +is a character, characters which map to multiple code points when
> >> +cased, e.g. ﬁ, are returned unchanged.
> >>  See also `capitalize', `downcase' and `upcase-initials'.  */)
> >
> > I think this doc string should say what to do if the application wants
> > to convert ﬁ into "FI".
> 
> Perhaps it would be better to describe it in Info page and link that
> from the docstrings?

Fine with me.

Thanks.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24603: [RFC 02/18] Generate upcase and downcase tables from Unicode data, (continued)
- bug#24603: [PATCH 0/3] Case table updates, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 3/3] Don’t generate ‘X maps to X’ entries in case tables, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 1/3] Add tests for casefiddle.c, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 2/3] Generate upcase and downcase tables from Unicode data, Michal Nazarewicz, 2016/10/17
  - bug#24603: [PATCH 0/3] Case table updates, Eli Zaretskii, 2016/10/18
    - bug#24603: [PATCH 0/3] Case table updates, Michal Nazarewicz, 2016/10/24
    - bug#24603: [PATCH 0/3] Case table updates, Eli Zaretskii, 2016/10/24

Prev by Date: bug#24624: 24.4; Faulty info link -> definition of nth
Next by Date: bug#24628: GNUtls initialization adds 500ms to startup --- do we really need it just for the RNG?
Previous by thread: bug#24603: [RFC 08/18] Support casing characters which map into multiple code points
Next by thread: bug#24603: [PATCH 0/3] Case table updates
Index(es):
- Date
- Thread