bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 3/3] dfa: optimize UTF-8 period


From: Jim Meyering
Subject: Re: [PATCH 3/3] dfa: optimize UTF-8 period
Date: Mon, 19 Apr 2010 21:44:31 +0200

Paolo Bonzini wrote:
> On 04/19/2010 10:07 AM, Jim Meyering wrote:
>> Paolo Bonzini wrote:
>>> On 04/17/2010 09:27 AM, Jim Meyering wrote:
>>>> Paolo Bonzini wrote:
>>>>> * NEWS: Document improvement.
>>>>> * src/dfa.c (struct dfa): Add utf8_anychar_classes.
>>>>> (add_utf8_anychar): New.
>>>>> (atom): Simplify if/else nesting.  Call add_utf8_anychar for ANYCHAR
>>>>> in UTF-8 locales.
>>>>> (dfaoptimize): Abort on ANYCHAR.
>>>>> ---
>>>>>    NEWS      |    6 ++++++
>>>>>    src/dfa.c |   46 +++++++++++++++++++++++++++++++++++++++++++---
>>>>>    2 files changed, 49 insertions(+), 3 deletions(-)
>>>>
>>>> Only quick superficial feedback for now:
>>>
>>> I pushed all patches but this.
>>
>> Thanks!
>>
>> Would you please add comments describing this one in more detail?
>> I ran out of time trying to understand how it works.
>
> Something like this?
>
>  /* For UTF-8 expand the period to a series of CSETs that define a valid
>     UTF-8 character.  This avoids using the slow multibyte path.  I'm
>     pretty sure it would be both profitable and correct to do it for
>     any encoding; however, the optimization must be done manually as
>     it is done above in add_utf8_anychar.  So, let's start with
>     UTF-8: it is the most used, and the structure of the encoding
>     makes the correctness more obvious.  */

Thanks.  That helps.
The key for me was to realize where this data came from:

+  static charclass utf8_classes[5] = {
+      {  0,  0,  0,  0, ~0, ~0, 0, 0 },
+      { ~0, ~0, ~0, ~0, 0, 0, 0, 0 },
+      {  0,  0,  0,  0,  0,  0, 0xfffffffcU, 0 },
+      {  0,  0,  0,  0,  0,  0,  0, 0xffff },
+      {  0,  0,  0,  0,  0,  0,  0, 0xff0000 }
+  };
+  const int n = sizeof (utf8_classes) / sizeof (utf8_classes[0]);
+  int i;
...

Then, this comment made a lot more sense:

+  /* (B|CA|DAA|EAAA) = (B|(C|(D|EA)A)A) =
+     B (C (D (E A CAT) OR A CAT) OR A CAT) OR */

A tiny comment saying that this describes the grammar of a UTF-8
character would go far in helping the uninitiated understand.

One final nit:
Please use "unsigned int" (not "int" and not "unsigned")
for both "i" and "n" above, since both are always non-negative.
That helps the reader, since then we don't have to wonder what
happens if they go negative, or if that can happen.

Thanks!




reply via email to

[Prev in Thread] Current Thread [Next in Thread]