[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bug: ODT export of Chinese text inserts spaces for line breaks
From: |
Maxim Nikulin |
Subject: |
Re: Bug: ODT export of Chinese text inserts spaces for line breaks |
Date: |
Wed, 30 Jun 2021 00:01:00 +0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 |
On 29/06/2021 10:47, James Harkins wrote:
So, it would make sense to add a rule to the exporter: if one of the
characters before or after a source-text line break is a Chinese,
Japanese or Korean character, do not add a space.
On 29/06/2021 11:43, tumashu wrote:
You can try the below config :-)
(let ((regexp "[[:multibyte:]]")
(string text))
(setq string
(replace-regexp-in-string
(format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
"\\1\\2" string))
Notice that [[:multibyte:]] means almost any non-ASCII script, e.g.
Cyrillic:
(let ((sample "abc абв def"))
(and (string-match "[[:multibyte:]]\+" sample)
(match-string 0 sample)))
"абв"
It seems, `org-fill-paragraph' M-q is smart enough to avoid a space
before or after a CJK character, so it is possible to determine correct
way to splice lines, despite e.g. "Script" Unicode property is not
exposed to elisp:
https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html
(Anyway maintaining explicit list of scripts is not a straightforward
approach.)
P.S.
JavaScript in browsers allows to filter characters that belong to
particular script:
"abc абв def".match(/\p{Script=Cyrillic}+/u)
Array [ "абв" ]
I have not found such feature in regular expressions available in Emacs.