emacs-orgmode
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug: ODT export of Chinese text inserts spaces for line breaks


From: Maxim Nikulin
Subject: Re: Bug: ODT export of Chinese text inserts spaces for line breaks
Date: Wed, 30 Jun 2021 00:01:00 +0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

On 29/06/2021 10:47, James Harkins wrote:
So, it would make sense to add a rule to the exporter: if one of the
characters before or after a source-text line break is a Chinese,
Japanese or Korean character, do not add a space.

On 29/06/2021 11:43, tumashu wrote:
You can try the below config :-)
     (let ((regexp "[[:multibyte:]]")
           (string text))
       (setq string
             (replace-regexp-in-string
              (format "\\(%s\\) *\n *\\(%s\\)" regexp regexp)
              "\\1\\2" string))

Notice that [[:multibyte:]] means almost any non-ASCII script, e.g. Cyrillic:

(let ((sample "abc абв def"))
  (and (string-match "[[:multibyte:]]\+" sample)
       (match-string 0 sample)))
"абв"

It seems, `org-fill-paragraph' M-q is smart enough to avoid a space before or after a CJK character, so it is possible to determine correct way to splice lines, despite e.g. "Script" Unicode property is not exposed to elisp: https://www.gnu.org/software/emacs/manual/html_node/elisp/Character-Properties.html (Anyway maintaining explicit list of scripts is not a straightforward approach.)

P.S.
JavaScript in browsers allows to filter characters that belong to particular script:

"abc абв def".match(/\p{Script=Cyrillic}+/u)
Array [ "абв" ]

I have not found such feature in regular expressions available in Emacs.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]