[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] Add prototype pure-DFA matcher

From: Paolo Bonzini
Subject: Re: [RFC] Add prototype pure-DFA matcher
Date: Sun, 06 Dec 2009 14:47:55 +0100
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20090922 Fedora/3.0-3.9.b4.fc12 Lightning/1.0pre Thunderbird/3.0b4

[going back on the list]

Thanks for this trimmed-down code. It at least gives an impression of
what is going on in GNU regex. But I would still like to start from scratch
for libunistring,
   1. because I want code that works the same way in UTF-8 as in UTF-32,

You can do that, I think, incrementally from the current code. It should be possible to tweak it so that OP_UTF8_PERIOD just "folds" into OP_PERIOD when you're processing UTF-8, and so on.

   2. in the hope that I can find a set of node/operators that is more
      efficient (peephole optimization, like you said it) and possibly
      integrate the "kwset" trick in some way.

I think the kwset search should be done separately (i.e. as preprocessing). The tree-based representation of regcomp.c should be quite amenable to extracting the required keywords.

One inefficiency I have in the posted code is that I try every single starting point, which makes for O(n^2) performance instead of O(n). It should be easy to prepend a .* in regcomp.c to fix this. But for everything else it is a pretty natural DFA implementation.

There are other optimizations possible, for example I think you only need MB_CUR_MAX elements in the state log so you could make it 8-entries long (or 4 if you only want the million-something Unicode characters). The current implementation still has a state log as big as the original string, which is a remnant of the support for backreferences.

It would be nice to have a POSIX (or almost POSIX) mode in libunistring. The multibyte DFA code is so bitrot in GNU grep that relying on libunistring for UTF-8 locales would be a good cleanup...


reply via email to

[Prev in Thread] Current Thread [Next in Thread]