[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#37036: [PATCH] Inconsistent ASCII and Latin char categories

From: Eli Zaretskii
Subject: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Fri, 16 Aug 2019 12:33:08 +0300

> From: Mattias Engdegård <address@hidden>
> Date: Fri, 16 Aug 2019 00:19:43 +0200
> Cc: address@hidden
> In any case, I wasn't aiming for perfection; that is indeed a fool's errand. 
> It was just a discovery of a rather obvious mistake, and evidence of code 
> that doesn't work properly because of it. I thought the patch would be rather 
> uncontroversial.

AFAIU, the patch made all the non-letter characters excluded from the
Latin category, is that right?  If so, it's a pretty significant
change IMO; who knows what it could break, including outside of the
core Emacs.  The fact that the Latin category is not well defined
doesn't yet mean we are at liberty of changing that (implied)
definition at will.  Categories are currently used for a small number
of core Emacs features, and AFAIR were created incrementally as the
ad-hoc need for each one of them arose, so we also risk breaking our
own code.  Do we really have a good reason to wake those sleeping

> >> Consider the function fill-polish-nobreak-p. It is clearly written with 
> >> the assumption of a reasonable definition of the Latin category, and it 
> >> doesn't work as expected because of that.
> > 
> > Can you tell the details of where this function doesn't work?  I'd
> > like to understand why fixing it needs to change the categories.
> Right: it attempts to match a single-character word before point, with the 
> assumption that \cl would match any Latin(-script) letter. However, since 
> that expression matches most of ASCII as well, the function incorrectly says 
> that line-breaking would be disallowed after "In my dreams..." or "(She 
> smiles!)" or "He died in 1951." (well, the equivalents in Polish).
> Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 .

So you are saying that function fails to consider punctuation and
symbols that are part of the Latin blocks?  That just means it
shouldn't use \cl in the first place (and yes, my suggestion to use
that in the bug discussion was wrong, sorry), it should use the
general-category Unicode property to filter out punctuation
characters.  Or it could use explicit ranges of codepoints.  Or we
could extend [:punct:] to support non-ASCII punctuation in a more
meaningful way.  Either way, that's not a reason good enough to make
significant changes in how the categories are defined.  If any
extensions are needed, I'd rather we made it in more modern and less
ad-hoc features.

> The point is that if there is some code that doesn't work because of the 
> broken categories, there may very well be more.

This argument goes both ways: there could be code out there which
relies on the current "broken" definition of the Latin category.

> > I don't think we should fix those mistakes, because that's an
> > impossible goal.  We should instead gradually stop using categories
> > for anything serious, certainly for any new code.  We should use the
> > UCD properties and the various char-tables built upon that instead.
> Perhaps, but categories still have one thing going for them: they have fairly 
> good regexp support.

I think this is in many cases an illusory advantage: specifying \cFOO
in a regexp just makes the code access some char-table.  But the same
is true for get-char-code-property and for accessing char-script-table
from Lisp, to mention just two alternatives.  And we all know that
using regular expressions for solving a problem sometimes _adds_ a
problem instead of solving one.

If we have some functionality in regular expressions that's supported
by categories, but is unavailable or inconvenient with Unicode
properties, I'd rather we extended our regex engine to support the
likes of \p{Po} and \p{script=greek}, see
http://unicode.org/reports/tr18/, instead of wasting our resources on
"fixing" the categories.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]