emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] add 'string-distance' to calculate Levenshtein distance


From: Eli Zaretskii
Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
Date: Sat, 14 Apr 2018 10:05:10 +0300

> From: Chen Bin <address@hidden>
> Date: Sat, 14 Apr 2018 12:35:08 +1000
> 
> Attached patch implemented 'string-distance' is much faster then
> existing Lisp implementation like 'org-babel-edit-distance'.

Thanks, please see a few comments below.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
> +       doc: /* Return Levenshtein distance of two strings.

Please mention the arguments in the first line of the doc string.

> +Case is significant, but text properties are ignored. */)
> +  (register Lisp_Object string1, Lisp_Object string2)

We don't like to use 'register' in new code, we prefer leaving this to
the compiler.

> +  CHECK_STRING (string1);
> +  CHECK_STRING (string2);
> +
> +  char *s1 = SSDATA (string1);
> +  char *s2 = SSDATA (string2);
> +  unsigned int s1len, s2len, x, y, lastdiag, olddiag;
> +  s1len = strlen(s1);
> +  s2len = strlen(s2);

I presume this function is supposed to work only on strings in their
internal representation (a.k.a. "multibyte strings"), so the function
should check that and signal an error if that's not so.
Alternatively, check that either both strings are unibyte or both are
multibyte, and signal an error if not.

> +  unsigned int column[s1len+1];

I'm not sure this is portable enough, but even if it is, it's not a
good idea to unconditionally make automatic variables in this case,
because s1len and s2len could be very large, in which case you will
get stack overflow.  Please use SAFE_ALLOCA instead.

> +  for (y = 1; y <= s1len; y++)
> +    column[y] = y;
> +  for (x = 1; x <= s2len; x++) {
> +    column[0] = x;
> +    for (y = 1, lastdiag = x-1; y <= s1len; y++) {
> +      olddiag = column[y];
> +      column[y] = min3(column[y] + 1, column[y-1] + 1, lastdiag + (s1[y-1] 
> == s2[x-1] ? 0 : 1));
> +      lastdiag = olddiag;

Is this algorithm really better than allocating just the diag matrix
(or part thereof), and accessing the string data directly, without
copying them inti local arrays?

> +  return make_number(column[s1len]);

Style: we leave a single blank between the function's name and the
following opening parenthesis (you have a few of these scattered
through the code).

Finally, this new primitives needs documentation (NEWS and the ELisp
manual), and also a couple of tests in the test suite.

Thanks again for working on this.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]