[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gperf-bugs] Patch to gperf 2.7.2 for case-independent keyword matc
Re: [Gperf-bugs] Patch to gperf 2.7.2 for case-independent keyword matching
Fri, 15 Feb 2002 11:44:17 -0500
Bruno Haible wrote:
> Bruce Lilly writes:
> > In several important situations, keywords must be matched
> > independent of case (email header field names, domain names,
> > etc.).
> The recommended way to do this is to convert the keywords to upper
> case (or lower case, as you like) before passing them to gperf.
The patch has been developed after careful consideration of the
alternatives and their implications.
Case conversion of keywords precludes storing them in a canonical
form, e.g. "MIME-Version" (or, at minimum, requires double the
space and additional programmer work to store canonical forms in
a structure along with the case-converted form).
It also requires conversion of keys for each lookup, and the case
conversion at run-time must be the same used for the keyword list
at compile time (including locale issues for 8-bit characters).
It requires extra space (for the converted copy of the string)
and additional function calls, not to mention additional work for
the programmer. One purpose of gperf is to simplify program
For a single hash function per compilation unit, the application
interface (the one called which does the case conversion, then
calls the gperf-generated function) has to provide a buffer for
the conversion (since the first argument points to a constant
string which cannot be modified). Depending on gperf flags, it
may be possible to determine the maximum size required. However,
since the key and length passed by the application may be
arbitrary, it will be necessary to check the length on each
character conversion to avoid buffer overflow and a segmentation
fault, or replicate the length check vs. MAX_WORD_LENGTH in the
application interface function. Alternatively, one could allocate
a buffer of the size given by the length argument using malloc,
but that is hardly likely to result in spectacular performance.
Moreover, both the gperf-generated case-dependent lookup
function and the application interface are externally visible
functions, which may result in namespace collisions or the use of
the wrong function, and is poor programming practice (it would
help is there were a gperf option to make the gperf-generated
For multiple hash functions per compilation unit, the table sizes
must be declared in an enum (--enum) to avoid compilation errors
or extraneous warnings, and the application interface must use
an arbitrary buffer size or use malloc. The issue of visibility
of multiple lookup functions and namespace issues remains.
> > The attached patch to gperf 2.7.2 provides the option to
> > generate output which does case-independent matching and
> > documents that option.
> gperf is a tool for maximum speed lookup. Your comparison function
> uses strcasecmp or strncasecmp at run time. This will not only
> case-convert the string being looked up more than once (namely, once
> in each comparison),
You are making unwarranted assumptions about how strcasecmp works.
Moreover, there are only multiple comparisons when there are hash
collisions. There is no case conversion involved in computing the
hash value, and if the hash value computed for a key is outside of
the range for the compiled keywords, there is no strcasecmp call.
> but also case-convert the fixed keywords (which
> you already should have case-converted before running gperf).
Again, unwarranted assumptions about the strcasecmp implementation.
And issues mentioned above w.r.t. canonical forms, compile-time vs.
run time issues, namespace issues, etc.
> In summary, this patch runs counter the general objective of gperf of
> producing fast code.
I strongly suspect that you are making off-the-cuff remarks rather
than comments resulting from actual performance tests.
Let's look at the relevant cases:
1. unsuccessful lookup, hash value out of range.
Your "recommended way" is slower than the patch as yours requires
copying and conversion of the key and additional function calls.
The patch results in no copying, no conversions and no strcasecmp
2. unsuccessful lookup, hash value in range but results in empty
Your "recommended way" has the same overhead as above. My patch
results in one call to strcasecmp which returns immediately as
one of the strings is empty. The patch is still faster than your
3. unsuccessful lookup, key doesn't match keyword.
Your "recommended way" has the same overhead as above; N.B. every
character in the key must be examined, copied, and converted.
strcasecmp returns early at the first mismatch and therefore need
not examine every character (and in proper implementations, there
is no copying and table lookup is used rather than conversion).
4. successful lookup, no hash collisions.
The one successful strcasecmp call is still cheaper than your
"recommended way" of copying and converting the string since
strcasecmp does no copying and requires no additional function
5. successful lookup, hash collisions.
In this least likely scenario (if there are many hash collisions,
an alternative to hash lookup, such as binary search, may be more
efficient), there may be multiple strcasecmp calls. Each call will
return at the first mismatch. Unlike your "recommended way", there
is no string copying and no additional function call overhead. The
patch is likely to provide better performance than your "recommended
way" except in case of quite long hash chains on long keywords with
identical prefixes (in which case, hash lookup is not a good method).
> Other than that, the use of strcasecmp and strncasecmp may not work if
> the keyword set contains 8-bit characters and the locale at run time
> is set differently than at compile time.
For the specific applications mentioned, characters outside of the
US-ASCII range are forbidden and therefore irrelevant. As mentioned
above, there are compile-time vs. run time issues for your "recommended
way" also, and they are no different in one method vs. the other. In
fact, it may well be impossible to use hash techniques for case-
independent table lookup where the table contains non-ASCII characters,
unless the hash table is built at run-time in the same locale as the
lookup function (i.e. gperf is not applicable in such a case since the
hash table is pre-built).