emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] add 'string-distance' to calculate Levenshtein distance


From: Eli Zaretskii
Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
Date: Sat, 14 Apr 2018 16:24:41 +0300

[Please CC the mailing list when you respond, so others could see your
messages.]

> From: Chen Bin <address@hidden>
> Date: Sat, 14 Apr 2018 21:01:46 +1000
> 
> Hi, Eli,
> Thanks for the review.
> 
> I fixed most issues except two things.
> 
> 1. In Asia, it's possible to cacluate distance between one unibyte and
> one multibyte string. As a Chinese, I might create a document containing
> Hanzi characters whose file name is obviously multibyte string. I may 
> need get the distance of this document to a file named "README.txt". 

If you mean unibyte pure-ASCII strings, then I agree.  But that
doesn't mean we should avoid the text altogether, because we might
compute non-zero distance between a string and its encoded unibyte
variant, which will confuse users.  At the very least the doc string
should say something about this.

> 2. Algorithm is based on 
> https://en.wikibooks.org/w/index.php?title=Algorithm_Implementation/Strings/Levenshtein_distance&stable=0#C
> It's optimized to use O(min(m,n)) space instead of O(mn).
> Say you compare two string whose string length is 512 bytes.
> You only need allocate 512 bytes instead of 262K (512*512)
> in memory.
> 
> Please check attached patch for latest code.
> 
> --- a/etc/NEWS
> +++ b/etc/NEWS
> @@ -463,6 +463,8 @@ x-lost-selection-hooks, x-sent-selection-hooks
>  +++
>  ** New function assoc-delete-all.
>  
> +** New function string-distance.

This should mention Levenshtein distance.

> +DEFUN ("string-distance", Fstring_distance, Sstring_distance, 2, 2, 0,
> +       doc: /* Return Levenshtein distance of STRING1 and STRING2.
                                              ^^^^^^^^^^^^^^^^^^^^^^
"between STRING1 and STRING2"

> +  unsigned int s1len, s2len, x, y, lastdiag, olddiag;

These variables should be declared EMACS_INT, not unsigned int,
because the size of Emacs strings can be larger than UINT_MAX,
especially on 64-bit systems.

> +  unsigned int *column = SAFE_ALLOCA ((s1len + 1) * sizeof (unsigned int));

Likewise here.

> +  char *s1 = SSDATA (string1);
> +  char *s2 = SSDATA (string2);
> +
> +  unsigned int s1len, s2len, x, y, lastdiag, olddiag;
> +  s1len = strlen(s1);
> +  s2len = strlen(s2);

You could optimize the code by using SCHARS and SBYTES, instead of
calling strlen.

> +(ert-deftest subr-tests--string-distance ()
> +  "Test `string-distance' behavior."
> +  (should (equal 1 (string-distance "heelo" "hello")))
> +  (should (equal 2 (string-distance "aeelo" "hello")))
> +  (should (equal 0 (string-distance "ab" "ab")))
> +  (should (equal 1 (string-distance "ab" "abc"))))

Could you please add a test or two with non-ASCII characters?

Thanks.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]