[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forw

From: Eduardo Ochs
Subject: bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends)
Date: Tue, 21 Oct 2008 12:00:58 -0400


this may not be exactly a bug, I'm just struggling with an obscure
part of Emacs... anyway, I did my best to make this look like a nice
bug report, and to make the tests clear enough to help other people
who also find unibyte<->multibyte conversions obscure...

The short story
Let me refer to strings like "<<tag>>" - where the "<<" and ">>" stand
for guillemets, i.e., the characters that we type with `C-x 8 <' and
`C-x 8 >' - as "anchors". So: if I produce an anchor string in a
unibyte buffer and then I search for an occurrence of that string in
multibyte buffer, the search fails.

The two small blocks below illustrate this. Instructions: save the
first one to "/tmp/1.txt", the second one to "/tmp/2.txt", and then

  (load-file "/tmp/1.txt")

It will show "uni" in the "*Messages*" buffer, and the search will
fail. The detailed message about the failure of the search will be
like this:

  progn: Search failed: "\302\253foo\302\273"

meaning the anchor string has been incorrectly converted.

;; -*- coding: raw-text-unix -*-
;; (save-this-block-as "/tmp/1.txt")
  (find-file "/tmp/2.txt")
  (goto-char (point-min))
  (setq anchorstr "«foo»")
  (message (if (multibyte-string-p anchorstr) "multi" "uni"))
  (search-forward anchorstr))

;; -*- coding: latin-1 -*-
;; (save-this-block-as "/tmp/2.txt")
(search-forward "«foo»")
;; «foo»

The long story
Save the block below as "/tmp/3.txt" and follow the instructions in
it. Note that it doesn't have any non-ascii characters - the anchors
are produced by running the "(insert ...)" sexps.

;; -*- coding: latin-1 -*-
;; (save-this-block-as "/tmp/3.txt")

;; Run the "progn" below with C-x C-e.
;; It will create a line like this:
;; <<anchor>>\253anchor\273\253anchor\273\253anchor\273
;; (but the "<<", ">>", "\253", "\273" are single characters).
;; Don't delete that line, it will be used later.
  (defun mmb (str) (string-make-multibyte str))
  (defun mub (str) (string-make-unibyte   str))
  (insert 171 "anchor" 187)
  (insert           "\253anchor\273")
  (insert      (mub "\253anchor\273"))
  (insert (mmb (mub "\253anchor\273")))

;; Now try to save this file.
;; Emacs will complain about the "\253"s and "\273"s - it will
;; say that iso-latin-1-unix and utf-8-unix cannot encode them.
;; The "<<" and ">>" are ok, though...
;; So: leave the "<<anchor>>" above, delete the "\253anchor\273"s,
;; save this file, and reload it. DON'T SKIP THIS STEP - the
;; charset properties mentioned below behave differently before
;; and after reloads, and I don't know exactly the mechanics of
;; this... 8-\
;; If we inspect the "<<", ">>" "\253", "\273" with `C-x ='
;; we see this:
;; Char: << (171, #o253, #xab, file #xAB)
;; Char: >> (187, #o273, #xbb, file #xBB)
;; Char: \253 (4194219, #o17777653, #x3fffab, raw-byte)
;; Char: \253 (4194235, #o17777673, #x3fffbb, raw-byte)
;; Now mark the "<<anchor>>" above and copy it to the top of
;; the kill ring with `M-w'. Let's examine the results of
;; several obvious ways to (re)create the "<<anchor>>"
;; above as a string...
;; Here are some of the results:
;;               "\253anchor\273"   ==> "<<anchor>>"
;;          (mub "\253anchor\273")  ==> "<<anchor>>"
;;     (mmb (mub "\253anchor\273")) ==> "\253anchor\273"
;;               (car kill-ring)    ==>
;;               #("<<anchor>>" 0 8 (charset iso-8859-1))
;;          (mub (car kill-ring))   ==> "<<anchor>>"
;;     (mmb (mub (car kill-ring)))  ==> "\253anchor\273"

                       (mub "\253anchor\273")
                  (mmb (mub "\253anchor\273"))
             (mub (mmb (mub "\253anchor\273")))
(mapcar 'identity           "\253anchor\273")
(mapcar 'identity      (mub "\253anchor\273"))
(mapcar 'identity (mmb (mub "\253anchor\273")))
                            (car kill-ring)
                       (mub (car kill-ring))
                  (mmb (mub (car kill-ring)))
(mapcar 'identity           (car kill-ring))
(mapcar 'identity      (mub (car kill-ring)))
(mapcar 'identity (mmb (mub (car kill-ring))))

;; This is the weird part.
;; Let's insert another "<<anchor>>"/"\253anchor\273" pair, and
;; let's try to jump to its "anchors" with `search-backward'.

(insert 171 "anchor" 187 "\n\253anchor\273")

(search-backward            "\253anchor\273")
(search-backward       (mub "\253anchor\273"))
(search-backward  (mmb (mub "\253anchor\273")))
(search-backward            (car kill-ring))
(search-backward       (mub (car kill-ring)))
(search-backward  (mmb (mub (car kill-ring))))

;; Only "(search-backward (car kill-ring))" jumps to
;; "<<anchor>>" - all the others jump to "\253anchor\273".
;; The trick - aha! - is that "(car kill-ring)" holds this
;; string,
;;          (car kill-ring)    ==>
;;          #("<<anchor>>" 0 8 (charset iso-8859-1))
;; and the "(charset iso-8859-1)" property is essential...

What is the standard way to convert unibyte strings (for example
anchor strings, generated from code in raw-text-unix ".el" files) to
strings with the right charset property (if needed) and the right
encoding? I couldn't find the functions for that...

  Cheers, thanks in advance,
    Eduardo Ochs
    eduardoochs at gmail.com

P.S.: (emacs-version) ==>
"GNU Emacs (i686-pc-linux-gnu, GTK+ Version 2.8.20)
 of 2008-10-11 on dekooning"

reply via email to

[Prev in Thread] Current Thread [Next in Thread]