bug-gperf
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ann: Working on mph algos


From: Bruno Haible
Subject: Re: Ann: Working on mph algos
Date: Wed, 23 Feb 2022 11:02:01 +0100

Hello Reini,

> The original code: https://github.com/rurban/nbperf (BSD3 Licensed)
> I'll bug assign@gnu for the assignment.
> Do we need permission from the original author also, or is BSD enough?

Per GNU policy, if you want code to be included upstream, a copyright
assignment to the FSF is required. In special cases, GNU packages also
incorporate code also without copyright assignment, but each such case
weakens the legal defense against companies and institutions who want
to abuse the code and who contempt the copyleft. And the number of such
parasitic companies and institutions has grown a lot in the last three
years.

Another problem is that we want source code in the packages, not binary
blobs. The 5 KB of data used by mi_vector_hash do not look like source code.
Recall: "The source code for a work means the preferred form of the work
for making modifications to it."

> Of course it's much slower than the default gperf algo, but for large lists
> gperf is not well
> suited. Also the ordering property of CHM is nice.

I agree that gperf is not well suited for large keyword sets; this is
documented in
https://www.gnu.org/software/gperf/manual/html_node/Bugs.html

A solution to this problem should IMO come
  - as an alternative algorithm,
  - with maintainable source code.

What I don't see at this point:

  - Why it would be better to have 3 algorithms than one?
    I mean, the user of gperf does not want to experiment a lot before
    getting a satisfying result. The fewer command line options the
    user has to try, the better.

  - What the "ordering property of CHM" means and why it is "nice"?
    I mean, a hash table is meant to implement a map
       string -> some DATA or NULL
    What is the abstraction that a hash function with ordering would
    provide?

  - Maintainability: The code should also be understandable by someone
    who has not read a 13-pages paper by Botelho.

So, for the goal of solving gperf's problem with large key sets:

  What are, in the current research, the algorithms for generating
  a decent (O(strlen(s))) hash function with an O(1) lookup - not
  necessarily a perfect hash function? The code in nbperf is from 2009.
  Likely some other algorithms have been published in the last 13 years?
  How do they compare?

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]