[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp filter to match non-english characters

From: Ted Zlatanov
Subject: Re: regexp filter to match non-english characters
Date: Thu, 06 Nov 2008 13:01:15 -0600
User-agent: Gnus/5.110011 (No Gnus v0.11) Emacs/23.0.60 (gnu/linux)

On Thu, 06 Nov 2008 10:43:38 -0600 "Robert D. Crawford" <> 

RDC> Score files are great.  Truth be told, I'm just looking for what works.
RDC> I like your solution but it will exclude posts with unicode characters,
RDC> which is something I would like to avoid if possible.

OK, so the question now is "how to tell if a character is in the Asian
Unicode character ranges."  Unfortunately I recall Emacs' own character
database will misrepresent some Latin characters, so I wouldn't depend
on character properties.

I looked at and picked
the blocks that looked useful.

(defun zme ()
  (let ((data "
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
1780..17FF; Khmer
1800..18AF; Mongolian
1900..194F; Limbu
1950..197F; Tai Le
1980..19DF; New Tai Lue
19E0..19FF; Khmer Symbols
1A00..1A1F; Buginese
1B00..1B7F; Balinese
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31C0..31EF; CJK Strokes
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4DC0..4DFF; Yijing Hexagram Symbols
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7AF; Hangul Syllables
F900..FAFF; CJK Compatibility Ideographs")
      (out ""))
  (dolist (line (split-string data "\n"))
    (dolist (item (split-string line ";"))
      (when (string-match "\\([0-9A-F]+\\)\\.\\.\\([0-9A-F]+\\)" item)
        (setq out 
              (concat out (format 
                           (match-string 1 item)
                           (match-string 2 item) ))))))
  (concat "[^" out "]")))

Evaluating this (you have to load the 'cl library too) gives


I don't know if this is good enough for you, but the ranges are correct
at least and you see how you can add more.  I tested with a few
characters like this:

(string-match (zme) "helloà´€")

and it seems to work OK.  In a score file you'll have only one backslash
but otherwise it should work.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]