Re: Regexp capturing unicode characters

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regexp capturing unicode characters

From:	Eli Zaretskii
Subject:	Re: Regexp capturing unicode characters
Date:	Thu, 01 Aug 2024 18:34:15 +0300

> Date: Thu, 01 Aug 2024 13:43:20 +0000
> From: Heime <heimeborgia@protonmail.com>
> Cc: help-gnu-emacs@gnu.org
> 
> > Why do you need that? Don't you know which characters you'd like to
> > match?
> 
> No, because language insertion in emacs depends upon the user.  But I want 
> to match foreign language characters mostly.

If by "foreign language characters" you mean letters and digits, then
[:alnum:] is what you want, as I already suggested.  This covers all
the characters that are either letters or digits, in all the
languages.

> > > Is there a way to show the characters that are members of each class ?
> > 
> > No, but you can check each character whether it matches a class.
> 
> What is the function name for doing that ?

string-match-p if you have a string or looking-at-p if you have it in
the buffer.

> Can one scan the buffer and list the matched character classes ?

Character classes overlap, so I'm not sure what kind of function you
want, and I don't think we have it anyway.  It's usually the other way
around: the author of a Lisp program knows in advance what kinds of
characters the program needs to match, and uses a regexp which will do
the job.

> > > Thought that [:multibyte:] captured the unicode characters. Bet even when
> > > I applied (set-buffer-multibyte t) to the buffer, I did not get matches.
> > 
> > Don't use [:multibyte:], it is hardly ever the right thing nowadays.
> 
> Can we update the manual with useful information such as with [:multibyte:] 
> please.

The useful information is already there (including a cross-reference
to a detailed description of what "multibyte" means).  I just
translated it into simpler terms, based on what you told about the job
you want to do, to save you from the need to read that if you don't
want to.

> > > Does [:word:] mean word in the english language only ?
> > 
> > 
> > No, it means characters that have the word syntax. IOW, which
> > character match depends on the major mode's syntax table. If you are
> > classifying characters from human-readable text, [:word:] is not the
> > right thing to use.
> 
> Can one show the syntax table ?  For me it is just word syntax table does 
> not give me enough information.  Perhaps give more explanation in the manual.

The manual already does that: there's a cross-reference in the
description of [:word:] which leads to the node "Syntax Class Table",
which explains syntax tables in detail.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/01
- Re: Regexp capturing unicode characters, Heime, 2024/08/01
  - Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/01
    - Re: Regexp capturing unicode characters, Heime, 2024/08/01
    - Re: Regexp capturing unicode characters, Michael Heerdegen, 2024/08/01
    - Re: Regexp capturing unicode characters, Eli Zaretskii <=
    - Re: Regexp capturing unicode characters, Heime, 2024/08/01
    - Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/01
    - Regexp capturing unicode characters, Heime, 2024/08/01
    - Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/02
    - Re: Regexp capturing unicode characters, uzibalqa, 2024/08/02

Prev by Date: Re: End of file during parsing
Next by Date: Re: End of file during parsing
Previous by thread: Re: Regexp capturing unicode characters
Next by thread: Re: Regexp capturing unicode characters
Index(es):
- Date
- Thread