Re: Regexp capturing unicode characters

help-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regexp capturing unicode characters

From:	Heime
Subject:	Re: Regexp capturing unicode characters
Date:	Thu, 01 Aug 2024 17:06:26 +0000

On Friday, August 2nd, 2024 at 3:34 AM, Eli Zaretskii <eliz@gnu.org> wrote:

> > Date: Thu, 01 Aug 2024 13:43:20 +0000
> > From: Heime heimeborgia@protonmail.com
> > Cc: help-gnu-emacs@gnu.org
> > 
> > > Why do you need that? Don't you know which characters you'd like to
> > > match?
> > 
> > No, because language insertion in emacs depends upon the user. But I want
> > to match foreign language characters mostly.
> 
> 
> If by "foreign language characters" you mean letters and digits, then
> [:alnum:] is what you want, as I already suggested. This covers all
> the characters that are either letters or digits, in all the
> languages.
> 
> > > > Is there a way to show the characters that are members of each class ?
> > > 
> > > No, but you can check each character whether it matches a class.
> > 
> > What is the function name for doing that ?
> 
> 
> string-match-p if you have a string or looking-at-p if you have it in
> the buffer.
> 
> > Can one scan the buffer and list the matched character classes ?
> 
> 
> Character classes overlap, so I'm not sure what kind of function you
> want, and I don't think we have it anyway. It's usually the other way
> around: the author of a Lisp program knows in advance what kinds of
> characters the program needs to match, and uses a regexp which will do
> the job.

I want to include in the regexp the possibility that the user wrote some
comment in a foreign language other than english.  Otherwise the regexp   
would simply skip them.  And your suggestion has been [alpha] and [:alnum:].
 
> > > > Thought that [:multibyte:] captured the unicode characters. Bet even 
> > > > when
> > > > I applied (set-buffer-multibyte t) to the buffer, I did not get matches.
> > > 
> > > Don't use [:multibyte:], it is hardly ever the right thing nowadays.
> > 
> > Can we update the manual with useful information such as with [:multibyte:] 
> > please.
> 
> 
> The useful information is already there (including a cross-reference
> to a detailed description of what "multibyte" means). I just
> translated it into simpler terms, based on what you told about the job
> you want to do, to save you from the need to read that if you don't
> want to.

A mention that [:multibyte:] is not used much nowadays.
 
> > > > Does [:word:] mean word in the english language only ?
> > > 
> > > No, it means characters that have the word syntax. IOW, which
> > > character match depends on the major mode's syntax table. If you are
> > > classifying characters from human-readable text, [:word:] is not the
> > > right thing to use.
> 
> > Can one show the syntax table ? For me it is just word syntax table does
> > not give me enough information. Perhaps give more explanation in the manual.
> 
> The manual already does that: there's a cross-reference in the
> description of [:word:] which leads to the node "Syntax Class Table",
> which explains syntax tables in detail.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/01
- Re: Regexp capturing unicode characters, Heime, 2024/08/01
  - Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/01
    - Re: Regexp capturing unicode characters, Heime, 2024/08/01
    - Re: Regexp capturing unicode characters, Michael Heerdegen, 2024/08/01
    - Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/01
    - Re: Regexp capturing unicode characters, Heime <=
    - Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/01
    - Regexp capturing unicode characters, Heime, 2024/08/01
    - Re: Regexp capturing unicode characters, Eli Zaretskii, 2024/08/02
    - Re: Regexp capturing unicode characters, uzibalqa, 2024/08/02

Prev by Date: Re: End of file during parsing
Next by Date: Re: Regexp capturing unicode characters
Previous by thread: Re: Regexp capturing unicode characters
Next by thread: Re: Regexp capturing unicode characters
Index(es):
- Date
- Thread