On Thu, Feb 14, 2019 at 1:55 PM João Távora <address@hidden> wrote:
> On Thu, Feb 14, 2019 at 1:36 PM Stefan Monnier <address@hidden> wrote:
> > > I don't think it's odd. I call the former "scattered match"
> > > and the latter a "tighter match".
> > Don't you find it odd that "foo" gives a better score to
> > "fotttttttttttttttto" than to "foto" ?
> Perhaps. But as it stands it doesn't make sense to
> compare scores across different length strings.
> It does make sense to compare the score between
> foo's matches of foot and foto.
> > [ Given them the same score sounds acceptable, tho. ]
> Feel free to add some kind of normalization to the function
> if you can come up with it. Or come up with another function
> entirely. As I said, a function that measures the distance
> in "boring editing operations" between the pattern and the target
> would be a nicer measure.
The Levenshtein distance and close variants generally do a reasonable job of approximating human expectations, yes. Close variants (eg: only delete/insert, no mutate) also work well.
The cost of building the Levenshtein edit distance matrix for picking or ordering candidates is quite high, though; https://en.wikipedia.org/wiki/Approximate_string_matching
gives a good summary, and you really don't want to use the most naive option if you can avoid it. Computation costs go through the roof very fast; the cost is roughly proportional to the length of the candidate cubed, using the brute force approach.
The other shortfall of using only edit distance is that most of the time people expect to front-weight the search: you may have a closer edit distance match later in the string, but people usually want a less exact match, closer to the start, intuitively.
I've been experimenting with a simpler function that just counts the
number of "holes" and the length of those holes separately
in the denominator. The numerator is the same and a perfect
match is still a 1. It seems to fare better for your cases. For foo
I'd very, very strongly encourage y'all to take a look at the existing work on the problem, not least because of the algorithmic complexity costs when you apply this to a large corpus of candidates. Consider, for example, completion over the set of all defined symbols in Emacs, which is ~ 55,000 candidates, and IIRC I calculated the average length to 9 characters, so brute force matching would be be on the order of 40 **million** comparisons for the first pattern character, though thankfully dropping fast as you eliminate non-matching candidates entirely. (Raw comparisons, accounting for all possible matches in all possible candidate, of that pattern, so each candidate gets $length / $pattern_length comparisons performed.)
Anyway, the point is: this is an area where serious study has been invested, and I strongly urge you to take advantage of that, rather than trying to reinvent this surprisingly complex wheel. This isn't simple to do, and it definitely isn't simple to do in a high performance way.
PS: whatever else you do, make it trivial to configure what function is used for ordering the candidate results. That matters far more that the details of the default match / order choice, because it allows easy innovation over time.
PPS: the best modern tools seem to be derived from the problem of genome sequence matching, but heuristics are definitely helpful too. Again, fzf is the best version of that I have found.