[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: master def6fa4246 2/2: Speed up string-lessp for multibyte strings
From: |
Eli Zaretskii |
Subject: |
Re: master def6fa4246 2/2: Speed up string-lessp for multibyte strings |
Date: |
Sat, 08 Oct 2022 21:25:29 +0300 |
> From: Mattias Engdegård <mattiase@acm.org>
> Date: Sat, 8 Oct 2022 18:49:11 +0200
> Cc: emacs-devel <emacs-devel@gnu.org>
>
> 7 okt. 2022 kl. 21.25 skrev Eli Zaretskii <eliz@gnu.org>:
> >
> >> + /* Two arbitrary multibyte strings: we cannot use memcmp because
> >> + the encoding for raw bytes would sort those between U+007F and U+0080
> >> + which isn't where we want them.
> >> + Instead, we skip the longest common prefix and look at
> >> + what follows. */
> >
> > I don't think I understand this; please elaborate. Didn't you say
> > that we never need to look beyond the first unequal byte? Then why
> > does the order of raw bytes matter here?
>
> The comment explains why memcmp cannot be used to compare arbitrary multibyte
> strings and it's exactly as it says: a bytewise comparison would not produce
> the same order as string-lessp has used in the past because of how we encode
> raw bytes, that's all.
As long as memcmp reports equality, we don't care, and once it reports
inequality, you can examine the first unequal bytes "by hand". Right?
So I still don't understand the comment and how it led you to the
conclusion.
I also asked about memmem -- did you consider using that?
> > Are you sure about the alignment?
>
> Actually I had asked someone about that before and received the answer that
> string data alignment was guaranteed, and a semi-thorough reading of the code
> seemed to confirm this -- normal allocation ensures alignment via struct
> sdata (q.v.) and while AUTO_STRING does not, it only makes unibyte strings
> which do not concern us in the code path in question.
AFAIU, AUTO_STRING can also generate stack-allocated multibyte strings.
> > why no tests for this?
>
> `string-lessp` has much better test coverage than what is typical for Emacs
> primitives
For non-ASCII strings?