help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [External] : Re: Regexp for matching control character, say, FORM FE


From: Hongyi Zhao
Subject: Re: [External] : Re: Regexp for matching control character, say, FORM FEED. (Was: Re: The `^L' appeared in built-in help.)
Date: Thu, 22 Jul 2021 17:45:36 +0800

On Thu, Jul 22, 2021 at 4:06 PM <tomas@tuxteam.de> wrote:
>
> On Thu, Jul 22, 2021 at 09:13:31AM +0800, Hongyi Zhao wrote:
>
> [...]
>
> > I want to know whether there are some similar regexp patterns in Emacs
> > as the ones used by grep, say, $'\014' or $'\f'.
>
> To offer some other perspective on the (correct) answers by Emanuel and
> Drew, remember that a regular expression is, basically, a string
> where each character is interpreted as "itself", unless it is a "regexp
> special" character [1]. So, for example searching for the regular expression
> "a" will find all "a"s in your text, because the character a isn't a
> "regexp special".
>
> Now ASCII control characters are all *not* "regexp special" so you only
> have to find a way to express them whithin a string. How, that is stated
> in the Emacs Lisp manual when it talks about "string type" [2] (especially
> the subnode "Non-ASCII Characters in Strings", which leads you to "character
> type" [3]. The special forms "\f", "\^L" or "\C-L" (all of them equivalent),
> which all were talked about here are treated in a subnode of the above [4].
> This notation carries some historical baggage, so don't expect too much
> logic from it.
>
> For example, why ^L? Because form feed is at point 12 (in decimal) in the
> ascii table, and L at point 76, the difference being 64.

$ man ascii |egrep  ' L$'
       014   12    0C    FF  '\f' (form feed)        114   76    4C    L

> What happens is that the "^" "subtracts 64 from the character code", or more 
> precisely
> masks out bit 6 of its binary representation.

$ man ascii |egrep  ' \^$'
       036   30    1E    RS  (record separator)      136   94    5E    ^

If so, the RS should be represented by ^^ in a self-consistent way :-)

> So ^M would be "carriage return" and so on. Just have a look at the ASCII 
> table.

$ man ascii |egrep  ' M$'
       015   13    0D    CR  '\r' (carriage ret)     115   77    4D    M


> Then "\f" comes from the C string literal representation. It's meant to
> be mnemonic ("f" for "form feed" -- similarly "\n" for "line feed", aka
> "new line", "\b" for "bell" and so on).
>
> The references below lead you to more alternative representations, like
> short hex "\x0C", short Unicode hex "\u000C", long Unicode hex "\U0000000C";
> there are also (mostly historical) octals, etc.
>
> You can even put the unicode /names/ in there, using the "\N{...}"
> notation, so your ^L can be named "\N{FORM FEED (FF)}" (yes the (FF)
> in parentheses is part of it: the Unicode Consortium put it in there.
> Life is like that).
>
> If you want to explore those unicode names, type in C-x 8 <RET>, you
> can autocomplete your way among them.
>
> Hope this gives some rough map for that landscape :-)

Thank you for your systematic and informative comments and explanations.

> Cheers
>
> [1] Emacs Lisp reference manual "Syntax of Regular Expressions"
>     or 
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Syntax-of-Regexps.html
>
>
> [2] Emacs Lisp reference manual "String Type" and its subnodes
>     or 
> https://www.gnu.org/software/emacs/manual/html_node/elisp/String-Type.html
>
> [3] Emacs Lisp reference manual "Character Type"
>     
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Type.html
>
> [4] Emacs Lisp reference manual "Control-Character Syntax"
>     
> https://www.gnu.org/software/emacs/manual/html_node/elisp/Ctl_002dChar-Syntax.html
>
>  - tomás

Best,
HY



reply via email to

[Prev in Thread] Current Thread [Next in Thread]