[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
pdist.m and squareform.m
From: |
Bill Denney |
Subject: |
pdist.m and squareform.m |
Date: |
Sun, 22 Oct 2006 17:20:12 -0400 |
User-agent: |
Thunderbird 1.5.0.7 (Windows/20060909) |
Here are the pdist and squareform functions. pdist.m is used for
clustering, and squareform is just used to better view the results of pdist.
Bill
scripts/ChangeLog:
2006-10-22 Bill Denney <address@hidden>
* statistics/base/pdist.m, statistics/base/squareform.m: new
functions for clustering analysis
## Copyright (C) 2006 Bill Denney <address@hidden>
##
## This file is part of Octave.
##
## Octave is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## Octave is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Octave; see the file COPYING. If not, write to the Free
## Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
## 02110-1301, USA.
## -*- texinfo -*-
## @deftypefn {Function File} address@hidden =} pdist (@var{x})
## @deftypefnx {Function File} address@hidden =} pdist (@var{x}, @var{distfun})
## @deftypefnx {Function File} address@hidden =} pdist (@var{x}, @var{distfun},
@var{distfunarg}, @dots{})
## Return the distance between any two rows in @var{x}.
##
## @var{x} is the matrix (n x m) to determine the distance between. If
## no @var{distfun} is given, then the 'euclidean' distance is assumed.
## @var{distfun} may be any of these or a function handle to a user
## defined function that takes two arguments distfun (@var{u}, @var{V})
## where @var{u} is a the row (1 x m) that is having its distance taken
## relative to @var{V} (a p x m matrix).
##
## The output vector, @var{y}, is (n - 1) * (n / 2) long where the
## distances are in the order [(1, 2); (1, 3); @dots{}; (2, 3); @dots{};
## (n-1, n)].
##
## Any additional arguments after the @var{distfun} are passed as
## distfun (@var{u}, @var{V}, @var{distfunarg1}, @var{distfunarg2} @dots{}).
##
## Pre-defined distance functions are:
##
## @table @samp
## @item "euclidean"
## Euclidean distance (default)
##
## @item "seuclidean"
## Standardized Euclidean distance. Each coordinate in the sum of
## squares is inverse weighted by the sample variance of that
## coordinate.
##
## @item "mahalanobis"
## Mahalanobis distance
##
## @item "cityblock"
## City Block metric (aka manhattan distance)
##
## @item "minkowski"
## Minkowski metric (with a default parameter 2)
##
## @item "cosine"
## One minus the cosine of the included angle between points (treated as
## vectors)
##
## @item "correlation"
## One minus the sample correlation between points (treated as
## sequences of values).
##
## @item "spearman"
## One minus the sample Spearman's rank correlation between
## observations, treated as sequences of values
##
## @item "hamming"
## Hamming distance, the percentage of coordinates that differ
##
## @item "jaccard"
## One minus the Jaccard coefficient, the percentage of nonzero
## coordinates that differ
##
## @item "chebychev"
## Chebychev distance (maximum coordinate difference)
## @end table
## @seealso{cluster,squareform}
## @end deftypefn
## Author: Bill Denney <address@hidden>
function y = pdist (x, distfun, varargin)
if (nargin < 1)
print_usage ();
elseif (nargin > 1) && ...
! (ischar (distfun) || ...
strcmp (class(distfun), "function_handle"))
error ("pdist: the distance function must be either a string or a function
handle.");
endif
if (nargin < 2)
distfun = "euclidean";
endif
if (isempty (x))
error ("pdist: x cannot be empty");
elseif (length (size (x)) > 2)
error ("pdist: x must be 1 or 2 dimensional");
endif
sx1 = size (x, 1);
y = [];
## compute the distance
for i = 1:sx1
tmpd = feval (distfun, x(i,:), x(i+1:sx1,:), varargin{:});
y = [y;tmpd(:)];
endfor
endfunction
## the different standardized distance functions
function d = euclidean(u, v)
d = sqrt (sum ((repmat (u, size (v,1), 1) - v).^2, 2));
endfunction
function d = seuclidean(u, v)
## FIXME
error("Not implemented")
endfunction
function d = mahalanobis(u, v, p)
repu = repmat (u, size (v,1), 1);
d = (repu - v)' * inv (cov (repu, v)) * (repu - v);
d = d.^(0.5);
endfunction
function d = cityblock(u, v)
d = sum (abs (repmat (u, size(v,1), 1) - v), 2);
endfunction
function d = minkowski
if (nargin < 3)
p = 2;
endif
d = (sum (abs (repmat (u, size(v,1), 1) - v).^p, 2)).^(1/p);
endfunction
function d = cosine(u, v)
repu = repmat (u, size (v,1), 1);
d = dot (repu, v, 2) ./ (dot(repu, repu).*dot(v, v));
endfunction
function d = correlation(u, v)
repu = repmat (u, size (v,1), 1);
d = cor(repu, v);
endfunction
function d = spearman(u, v)
repu = repmat (u, size (v,1), 1);
d = spearman (repu, v);
endfunction
function d = hamming(u, v)
## Hamming distance, the percentage of coordinates that differ
sv2 = size(v, 2);
for i = 1:sv2
v(:,i) = (v(:,i) == u(i));
endfor
d = sum (v,2)./sv2;
endfunction
function d = jaccard(u, v)
## Jaccard distance, one minus the percentage of non-zero coordinates
## that differ
sv2 = size(v, 2);
for i = 1:sv2
v(:,i) = (v(:,i) == u(i)) && (u(i) || v(:,i));
endfor
d = 1 - sum (v,2)./sv2;
endfunction
function d = chebychev(u, v)
repu = repmat (u, size (v,1), 1);
d = max (abs (repu - v), [], 2);
endfunction
## Copyright (C) 2006 Bill Denney <address@hidden>
##
## This file is part of Octave.
##
## Octave is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## Octave is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Octave; see the file COPYING. If not, write to the Free
## Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
## 02110-1301, USA.
## -*- texinfo -*-
## @deftypefn {Function File} address@hidden =} squareform (@var{x})
## @deftypefnx {Function File} address@hidden =} squareform (@var{x},
"tovector")
## @deftypefnx {Function File} address@hidden =} squareform (@var{x},
"tomatrix")
## Convert a vector from the pdist function into a square matrix or from
## a square matrix back to the vector form.
##
## The second argument is used to specify the output type in case there
## is a single element.
## @seealso{pdist}
## @end deftypefn
## Author: Bill Denney <address@hidden>
function y = squareform (x, method)
if nargin < 1
print_usage ();
elseif nargin < 2
if isscalar (x) || isvector (x)
method = "tomatrix";
elseif issquare (x)
method = "tovector";
else
error ("squareform: cannot deal with a nonsquare, nonvector input");
endif
endif
method = lower (method);
if ! strcmp ({"tovector" "tomatrix"}, method)
error ("squareform: method must be either \"tovector\" or \"tomatrix\"");
endif
if strcmp ("tovector", method)
if ! issquare (x)
error ("squareform: x is not a square matrix");
endif
sx = size (x, 1);
y = zeros ((sx-1)*sx/2, 1);
idx = 1;
for i = 2:sx
newidx = idx + sx - i;
y(idx:newidx) = x(i:sx,i-1);
idx = newidx + 1;
endfor
else
## we're converting to a matrix
## the dimensions of y are the solution to the quadratic formula for:
## length(x) = (sy-1)*(sy/2)
sy = (1 + sqrt (1+ 8*length (x)))/2;
y = zeros (sy);
for i = 1:sy-1
step = sy - i;
y((sy-step+1):sy,i) = x(1:step)';
x(1:step) = [];
endfor
y = y + y';
endif
endfunction
## make sure that it can go both directions automatically
%!assert(squareform(1:6), [0 1 2 3;1 0 4 5;2 4 0 6;3 5 6 0])
%!assert(squareform([0 1 2 3;1 0 4 5;2 4 0 6;3 5 6 0]), [1:6]')
## make sure that the command arguments force the correct behavior
%!assert(squareform(1), [0 1;1 0])
%!assert(squareform(1, "tomatrix"), [0 1;1 0])
%!assert(squareform(1, "tovector"), [])
- pdist.m and squareform.m,
Bill Denney <=