[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: regexp filter to match non-english characters
From: |
Michal Nazarewicz |
Subject: |
Re: regexp filter to match non-english characters |
Date: |
Thu, 06 Nov 2008 10:34:25 +0100 |
User-agent: |
Gnus/5.110011 (No Gnus v0.11) Emacs/23.0 (Slckware Linux) |
Ted Zlatanov <tzz@lifelogs.com> writes:
> On Wed, 05 Nov 2008 14:14:31 -0600 "Robert D. Crawford" <rdc1x@comcast.net>
> wrote:
>
> RDC> Ted Zlatanov <tzz@lifelogs.com> writes:
>>> (string-match "[^\\000-\\1ff]" "hello") ;; OK
>>>
>>> This will match character values over 0x1FF, which is the limit of
>>> extended ASCII. Does that work for you?
>
> RDC> Will this match the unicode double ">" and the like? Some people
> RDC> feel the need to use these in their breadcrumbs and such. If
> RDC> there is no way to just filter out the foreign characters, I will
> RDC> use it.
>
> You can just try it!
>
> (string-match "[^\\000-\\1ff]" "ยป") ;; returns 0, meaning it's a match
> (string-match "[^\\000-\\1ff]" ">>") ;; returns nil, meaning it's not a match
>
> RDC> The other possibility is to lower permanently on each character that is
> RDC> read to me, but this seems tedious and time consuming on my part and
> RDC> likely slow for gnus to score.
>
> Nah, the above should work. You will need a single backslash instead of
> two, though (the doubling is needed to tell Emacs Lisp that's a real
> backslash inside the string when it reads it in).
"<<" and ">>" have codes U+00AB and U+00BB so that's why they match but
there are plenty of other characters which may show up in an English
text, like (I'll use a (sequence of) ASCII characters which resembles
the proper unicode character) "`" (U+2018), "'" (U+2019), "``" (U+201C)
, "''" (U+201D) or "..." (U+2026) which will cause the entry to be
filtered out.
Besides, I think what you really meant was:
(string-match "[^\\0-\\177]" "string")
since "1ff" is not a valid octal number.
I think that taking the title of the entry and checking if at least 90%
are ASCII characters would be sufficient to filter out Asian texts. You
can also try taking first 100 (or so) characters of the body. I think
you could use replace-regexp-in-string for that purpose:
(defun mn-non-english-p (string)
(>
(* (length (replace-regexp-in-string "[^\\0-\\77]" "" string)) 10)
(* (length string) 9)))
--
Best regards, _ _
.o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +--<mina86*tlen.pl>--<jid:mina86*jabber.org>--ooO--(_)--Ooo--
- regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Message not available
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
- Re: regexp filter to match non-english characters,
Michal Nazarewicz <=
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
- Message not available
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06