[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [O] [BUG] Mark-up handling chokes on Unicode white-space
From: |
Tobias Getzner |
Subject: |
Re: [O] [BUG] Mark-up handling chokes on Unicode white-space |
Date: |
Wed, 24 Sep 2014 09:34:25 +0200 |
Hi Aaron,
On Di, 2014-09-23 at 14:15 -0400, Aaron Ecay wrote:
> org-emphasis-regexp-components is known to be a wart. You can search
> for posts on the mailing list. Some people are trying to figure out how
> to get rid of it. (You can search in particular for Nicolas Goaziou’s
> posts...) Here’s one thread where you can see the lay of the land:
> <http://mid.gmane.org/address@hidden>.
Thank you for the background info!
> All that to say, the longer-term solution is to figure out some radically
> different approach. In the meantime though, if you can provide a list of
> characters (by unicode name and/or code point) that you think should be
> added to that variable, someone might be able to add them.
I guess the straightforward way of defining white-space would be just
using the set of characters with the Unicode property WSpace=Y, and
this would be what «[:space:]», «\s«, etc., should be expected to match
on Unicode-based locales. I’m supplying a list of code-points below,
for convenience.
I agree though that defining what counts as «white space» within the
confines of org-mode is putting the cart before the horse. I’ll try to
ascertain whether the Emacs implementation of «[:space:]» really only
does 8-bit spaces, and if so I’ll see whether I can poke someone on the
Emacs bug tracker about this.
Best regards,
T.
──────────────────────────────────────────────────────────────────────
List of Unicode white-space
Below is the list of characters with the property White_Space set,
taken from the Unicode 7.0.0 character database. This includes
line-breaking white-space such as «line feed». If these are not
relevant, one can use the subset of space separators (Zs; these do not
include control characters such as Tab) and control chars (Cc).
0009..000D ; White_Space # Cc [5] <control-0009>..<control-000D>
0020 ; White_Space # Zs SPACE
0085 ; White_Space # Cc <control-0085>
00A0 ; White_Space # Zs NO-BREAK SPACE
1680 ; White_Space # Zs OGHAM SPACE MARK
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
2028 ; White_Space # Zl LINE SEPARATOR
2029 ; White_Space # Zp PARAGRAPH SEPARATOR
202F ; White_Space # Zs NARROW NO-BREAK SPACE
205F ; White_Space # Zs MEDIUM MATHEMATICAL SPACE
3000 ; White_Space # Zs IDEOGRAPHIC SPACE
──────────────────────────────────────────────────────────────────────