Re: [PATCH 3/3] dfa: optimize UTF-8 period

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 3/3] dfa: optimize UTF-8 period

From:	Jim Meyering
Subject:	Re: [PATCH 3/3] dfa: optimize UTF-8 period
Date:	Mon, 19 Apr 2010 21:44:31 +0200

Paolo Bonzini wrote:
> On 04/19/2010 10:07 AM, Jim Meyering wrote:
>> Paolo Bonzini wrote:
>>> On 04/17/2010 09:27 AM, Jim Meyering wrote:
>>>> Paolo Bonzini wrote:
>>>>> * NEWS: Document improvement.
>>>>> * src/dfa.c (struct dfa): Add utf8_anychar_classes.
>>>>> (add_utf8_anychar): New.
>>>>> (atom): Simplify if/else nesting.  Call add_utf8_anychar for ANYCHAR
>>>>> in UTF-8 locales.
>>>>> (dfaoptimize): Abort on ANYCHAR.
>>>>> ---
>>>>>    NEWS      |    6 ++++++
>>>>>    src/dfa.c |   46 +++++++++++++++++++++++++++++++++++++++++++---
>>>>>    2 files changed, 49 insertions(+), 3 deletions(-)
>>>>
>>>> Only quick superficial feedback for now:
>>>
>>> I pushed all patches but this.
>>
>> Thanks!
>>
>> Would you please add comments describing this one in more detail?
>> I ran out of time trying to understand how it works.
>
> Something like this?
>
>  /* For UTF-8 expand the period to a series of CSETs that define a valid
>     UTF-8 character.  This avoids using the slow multibyte path.  I'm
>     pretty sure it would be both profitable and correct to do it for
>     any encoding; however, the optimization must be done manually as
>     it is done above in add_utf8_anychar.  So, let's start with
>     UTF-8: it is the most used, and the structure of the encoding
>     makes the correctness more obvious.  */

Thanks.  That helps.
The key for me was to realize where this data came from:

+  static charclass utf8_classes[5] = {
+      {  0,  0,  0,  0, ~0, ~0, 0, 0 },
+      { ~0, ~0, ~0, ~0, 0, 0, 0, 0 },
+      {  0,  0,  0,  0,  0,  0, 0xfffffffcU, 0 },
+      {  0,  0,  0,  0,  0,  0,  0, 0xffff },
+      {  0,  0,  0,  0,  0,  0,  0, 0xff0000 }
+  };
+  const int n = sizeof (utf8_classes) / sizeof (utf8_classes[0]);
+  int i;
...

Then, this comment made a lot more sense:

+  /* (B|CA|DAA|EAAA) = (B|(C|(D|EA)A)A) =
+     B (C (D (E A CAT) OR A CAT) OR A CAT) OR */

A tiny comment saying that this describes the grammar of a UTF-8
character would go far in helping the uninitiated understand.

One final nit:
Please use "unsigned int" (not "int" and not "unsigned")
for both "i" and "n" above, since both are always non-negative.
That helps the reader, since then we don't have to wonder what
happens if they go negative, or if that can happen.

Thanks!

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 1/3] dfa: simplify dfainit, Paolo Bonzini, 2010/04/16
- [PATCH 2/3] dfa: fix {0,0}, Paolo Bonzini, 2010/04/16
  - Re: [PATCH 2/3] dfa: fix {0,0}, Jim Meyering, 2010/04/17
    - Re: [PATCH 2/3] dfa: fix {0,0}, Paolo Bonzini, 2010/04/18
    - Re: [PATCH 2/3] dfa: fix {0,0}, Jim Meyering, 2010/04/19
- [PATCH 3/3] dfa: optimize UTF-8 period, Paolo Bonzini, 2010/04/16
  - Re: [PATCH 3/3] dfa: optimize UTF-8 period, Jim Meyering, 2010/04/17
    - Re: [PATCH 3/3] dfa: optimize UTF-8 period, Paolo Bonzini, 2010/04/19
    - Re: [PATCH 3/3] dfa: optimize UTF-8 period, Jim Meyering, 2010/04/19
    - Re: [PATCH 3/3] dfa: optimize UTF-8 period, Paolo Bonzini, 2010/04/19
    - Re: [PATCH 3/3] dfa: optimize UTF-8 period, Jim Meyering <=
- Re: [PATCH 1/3] dfa: simplify dfainit, Jim Meyering, 2010/04/17

Prev by Date: Re: [PATCH v2] dfa: optimize UTF-8 period
Next by Date: [PATCH 1/2] maint: restrict scope of two globals to dfasearch.c
Previous by thread: Re: [PATCH 3/3] dfa: optimize UTF-8 period
Next by thread: Re: [PATCH 1/3] dfa: simplify dfainit
Index(es):
- Date
- Thread