bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem with Boyer Moore and Greek characters


From: Kenichi Handa
Subject: Re: Problem with Boyer Moore and Greek characters
Date: Tue, 7 May 2002 22:35:29 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.1.30 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

Sorry for the late reply on this matter.

Although I don't understand this part of code fully, it
seems that your fix is correct.  Richard, what do you think?
Shall I install it (both in HEAD and RC)?

---
Ken'ichi HANDA
handa@etl.go.jp

Thomas Morgan <tlm@pocketmail.com> writes:
> I ran GNU Emacs 21.1.1 (i686-pc-linux-gnu, X toolkit) with the options
> `--q --no-site-file', then typed the following into `*scratch*':

>   (search-forward "á½·")
>   á½»

> (The first Greek character is an accented iota represented in Emacs by
> the character number 342199, and the second is an accented upsilon
> represented by 342203.  I entered them with the input method
> `greek-ibycus4'.)

> Then I pressed `C-p' and `C-e' to move point to the end of the first
> line, and `C-x C-e' to evaluate the expression.

> Here is the exact input for all of that:

> ( s e a r c h - f o r w a r d SPC " C-x <return> C-\ 
> g r e e k - i b y c u s 4 <return> i ' C-\ " ) <return> 
> C-\ u ' C-\ C-p C-e C-x C-e

> This moved the cursor to the end of the second line, and displayed
> `214', the new position of point, in the echo area.  So searching for
> the iota found the upsilon.  This must be a bug.

> Boyer Moore searching compares only the last bytes of the characters,
> and this leads to the problem.  If you capitalize the accented iota,
> the last byte is the same as the last byte of the upsilon, although
> their second-to-last bytes are different.

> Capital accented iota \234\364\362\273
> Small accented upsilon        \234\364\361\273

> So before doing a Boyer Moore search, `search_buffer' needs to check
> that the character and its inversion have the same first three bytes.
> Here is the patch I made to do that.  Please forgive my mistakes; I am
> not a programmer.

> cd ~/emacs-21.1/src/
> diff -c /home/tlm/emacs-21.1/src/search.c.\~1\~ 
> /home/tlm/emacs-21.1/src/search.c
> *** /home/tlm/emacs-21.1/src/search.c.~1~     Mon Oct  1 02:08:20 2001
> --- /home/tlm/emacs-21.1/src/search.c Wed Apr  3 07:53:39 2002
> ***************
> *** 1237,1243 ****
>                 /* Keep track of which character set row
>                    contains the characters that need translation.  */
>                 int charset_base_code = c & ~CHAR_FIELD3_MASK;
> !               if (charset_base == -1)
>                   charset_base = charset_base_code;
>                 else if (charset_base != charset_base_code)
>                   /* If two different rows appear, needing translation,
> --- 1237,1246 ----
>                 /* Keep track of which character set row
>                    contains the characters that need translation.  */
>                 int charset_base_code = c & ~CHAR_FIELD3_MASK;
> !               int inverse_charset_base = inverse & ~CHAR_FIELD3_MASK;
> !               if (charset_base_code != inverse_charset_base)
> !                 boyer_moore_ok = 0;
> !               else if (charset_base == -1)
>                   charset_base = charset_base_code;
>                 else if (charset_base != charset_base_code)
>                   /* If two different rows appear, needing translation,

> Diff finished at Wed Apr  3 08:00:10


> _______________________________________________
> Bug-gnu-emacs mailing list
> Bug-gnu-emacs@gnu.org
> http://mail.gnu.org/mailman/listinfo/bug-gnu-emacs




reply via email to

[Prev in Thread] Current Thread [Next in Thread]