octave-patch-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-patch-tracker] [patch #8426] Speeding up the edit distance


From: Max G.
Subject: [Octave-patch-tracker] [patch #8426] Speeding up the edit distance
Date: Fri, 28 Mar 2014 17:06:16 +0000
User-agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/33.0.1750.152 Chrome/33.0.1750.152 Safari/537.36

URL:
  <http://savannah.gnu.org/patch/?8426>

                 Summary: Speeding up the edit distance
                 Project: GNU Octave
            Submitted by: maxg
            Submitted on: Fr 28 Mär 2014 17:06:15 GMT
                Category: None
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any

    _______________________________________________________

Details:

Calculating the Levenshtein distance provided by the package "strings" is a
pain.

It is unbelievable slow, not only because it uses the algorithm of Wagner and
Fisher, but also because it does this using loops. That leads to calculation
times of 10 seconds for simple strings of length 200 and wastes plenty of
memory. See for example http://savannah.gnu.org/bugs/?41920 .

Therefore I implemented the algorithm of Berghel and Roach
(http://www.berghel.net/publications/asm/asm.php), who improve on Ukkonen
(http://www.sciencedirect.com/science/article/pii/S0019995885800462).

Great things first: Now it is possible to calculate

editdistance(repmat('AB',1,20000),repmat('BA',1,20000)

in less than 2 seconds.

The new version is fully compatible with the former version, meaning that
(i) it calculates a specific matrix when asked for
(ii) can apply some specific weights

That comes at the cost of
(i) a new optional input parameter
(ii) not being able to use the new algorithm all the time.

Depending on its preferences, the user now can decide to
(i) use the former algorithm
(ii) use the algorithm of Berghel and Roach, which is almost always better but
may have space issues as well
(iii) use a version with m*n runtime but linear memory usage.

I tested the correctness of the new algorithms by running the provided script
uncountable times. Please note, that this produces test which are almost worst
case for the algorithm by Berghel and Roach.

Please note further, that there exists a version of the algorithm used here,
that uses only O(dist) memory
(https://code.google.com/p/google-web-toolkit/source/browse/trunk/dev/core/src/com/google/gwt/dev/util/editdistance/ModifiedBerghelRoachEditDistance.java?r=8928).
However, I do not have the time to implement that one.

I really hope this helps and will be applied.



    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Fr 28 Mär 2014 17:06:15 GMT  Name: editdistance.m  Size: 6kB   By: maxg

<http://savannah.gnu.org/patch/download.php?file_id=31070>
-------------------------------------------------------
Date: Fr 28 Mär 2014 17:06:15 GMT  Name: testEditDistance.m  Size: 614B   By:
maxg

<http://savannah.gnu.org/patch/download.php?file_id=31071>

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/patch/?8426>

_______________________________________________
  Nachricht gesendet von/durch Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]