[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: decode-coding-string gone awry?

From: Kenichi Handa
Subject: Re: decode-coding-string gone awry?
Date: Mon, 14 Feb 2005 10:50:25 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.3.50 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

In article <address@hidden>, David Kastrup <address@hidden> writes:
> I have the problem that within preview-latex there is a function that
> assembles UTF-8 strings from single characters.  This function, when
> used manually, mostly works.  It is called within a process sentinel
> and fails rather consistently there with a current CVS Emacs.  I
> include the code here since I don't know what might be involved here:
> regexp-quote, substring, char-to-string etc.  The starting string is
> taken from a buffer containing only ASCII (inserted by a process with
> coding-system 'raw-text).

It seems that you are caught in a trap of automatic
unibyte->multibyte conversion.

> (defun preview-error-quote (string)
>   "Turn STRING with potential ^^ sequences into a regexp.
> To preserve sanity, additional ^ prefixes are matched literally,
> so the character represented by ^^^ preceding extended characters
> will not get matched, usually."
>   (let (output case-fold-search)
>     (while (string-match 
> "\\^\\{2,\\}\\(\\(address@hidden)\\|[8-9a-f][0-9a-f]\\)"
>                        string)
>       (setq output
>           (concat output
>                   (regexp-quote (substring string
>                                            0
>                                            (- (match-beginning 1) 2)))

If STRING is taken from a multibyte buffer, it is a
multibyte string.  Thus, the above substring also returns a
multibyte string.

>                   (if (match-beginning 2)
>                       (concat
>                        "\\(?:" (regexp-quote
>                                 (substring string
>                                            (- (match-beginning 1) 2)
>                                            (match-end 0)))
>                        "\\|"
>                        (char-to-string
>                         (logxor (aref string (match-beginning 2)) 64))
>                        "\\)")
>                     (char-to-string
>                      (string-to-number (match-string 1 string) 16))))

But, this char-to-string produces a unibyte string.  So, on
concatinating them, this unibyte string is automatically
converted to multibyte by string-make-multibyte function
which usually produces a multibyte string containing latin-1

>           string (substring string (match-end 0))))
>     (setq output (concat output (regexp-quote string)))
>     (if (featurep 'mule)
>       (prog2
>           (message "%S %S " output buffer-file-coding-system)
>           (setq output (decode-coding-string output 
> buffer-file-coding-system))

And this decode-coding-string treats the internal byte
sequence of a multibyte string OUTPUT as utf-8, thus you get
some garbage.

> Unfortunately, when I call this stuff by hand instead from the
> process-sentinel, it mostly works

That is because the string you give to preview-error-quote
is a unibyte string in that case.  The Lisp reader generates
a unibyte string when it sees ASCII-only string.

Ex: (multibyte-string-p "abc") => nil

This will also return incorrect string.

  (string-to-multibyte "r Weise $f$ um~$1$ erh^^c3^^b6ht und $e$"))

So, the easiest fix will be to do:
  (setq string (string-as-unibyte string))
in the head of preview-error-quote.

Ken'ichi HANDA

reply via email to

[Prev in Thread] Current Thread [Next in Thread]