Re: moving toward a 3.0 release

octave-maintainers
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: moving toward a 3.0 release

From:	David Bateman
Subject:	Re: moving toward a 3.0 release
Date:	Thu, 28 Sep 2006 00:00:40 +0200
User-agent:	Thunderbird 1.5.0.7 (X11/20060921)
John W. Eaton wrote:
> On 27-Sep-2006, David Bateman wrote:
> 
> | I had hoped to get it ready for 2.9.9 but the segfault I'm having is
> | proving rather persistent. If you want to try to help diagnose the
> | problem, take the version of eigs attached (some small bug fixes
> | relative to the last version) and try
> | 
> | segtest(10000)
> | 
> | function aerr = segtest(iter)
> |   %% This will seg-fault octave consistently, but not matlab.
> |   n=20;
> |   k=4;
> |   A =
> | 
> sparse([3:n,1:n,1:(n-2)],[1:(n-2),1:n,3:n],[ones(1,n-2),4*ones(1,n),-ones(1,n-2)]);
> |   opts.disp = 0;
> |   aerr = 0;
> |   for i=1:iter
> |     [v1,d1] = eigs(A, k, 'sr', opts);
> |     d1 = diag(d1);
> |     merr = 0;
> |     for i=1:k
> |       newerr = max(abs((A - d1(i)*speye(n))*v1(:,i)));
> |       if (newerr > merr)
> |         merr = newerr;
> |       end
> |     end
> |     fprintf('Max Err: %g\n', merr);
> |     if (merr > aerr)
> |       aerr = merr;
> |     end
> |   end
> | end
> | 
> | I can get it to seg-fault about once every 20000 by enlarging some of
> | the dneupd and daupd work arrays above the recommended sizes, but can't
> | eliminate it. valgrind seems to indicate that its the variables "v",
> | "dr" and "di" allocated with OCTAVE_LOCAL_BUFFER that are causing the
> | problems, Looking at arpack++ they add one to the recocmmended values
> | and that seems to make the dominant error the one due to the variable v.
> | BTW, FreeMat also seems to have the same issue, and I can crash it in
> | much the same way.
> | 
> | One difference I see with arpack++ relative to octave is that arpack++
> | uses the new/delete c++ operators on the double, etc types, rather than
> | the std::vector class as the OCTAVE_LOCAL_BUFFER code currently does.
> | Though, why that should make a difference, I don't know. I'll try and
> | see if it helps..
> 
> I ran the example and it also crashed for me, but I don't think that I
> can effectivley debug this since I know nothing about arpack, and your
> function taht uses it is fairly large, so it is difficult for me to
> know whether the calls to the arpack routines are correct (have
> correctly dimensioned arrays, etc.).
> 
> I see the crash that looks like this:
> 
> *** glibc detected *** malloc(): memory corruption: 0x00000000017bd7c0 ***
> 
> Program received signal SIGABRT, Aborted.
> [Switching to Thread 46994007574704 (LWP 7885)]
> 0x00002abda4db907b in raise () from /lib/libc.so.6
> (gdb) where
> #0  0x00002abda4db907b in raise () from /lib/libc.so.6
> #1  0x00002abda4dba84e in abort () from /lib/libc.so.6
> #2  0x00002abda4def639 in __fsetlocking () from /lib/libc.so.6
> #3  0x00002abda4df6892 in free () from /lib/libc.so.6
> #4  0x00002abda4df81ad in malloc () from /lib/libc.so.6
> #5  0x00002abda4b38e1d in operator new () from /usr/lib/libstdc++.so.6
> #6  0x00002abda4b38f29 in operator new[] () from /usr/lib/libstdc++.so.6
> #7  0x00002abda5e406cc in ArrayRep (this=0x168b400, n=6)
>     at /usr/include/octave-2.9.8/octave/Array.h:70
> #8  0x00002abda5e40f38 in Array (this=0x7fffffd04500, n=6)
>     at /usr/include/octave-2.9.8/octave/Array.h:187
> #9  0x00002abda5e40fad in MArray (this=0x7fffffd04500, n=6)
>     at /usr/include/octave-2.9.8/octave/MArray.h:50
> #10 0x00002abda5e40fdd in ComplexColumnVector (this=0x7fffffd04500, n=6)
>     at /usr/include/octave-2.9.8/octave/CColVector.h:41
> #11 0x00002abda5e38f15 in Feigs (address@hidden, nargout=2) at eigs.cc:1285
> 
> That line of eigs.cc is the constructor for eig_val, after the call to
> dneupd.
> 
>                     F77_FUNC (dneupd, DNEUPD) 
>                       (rvec, F77_CONST_CHAR_ARG2 ("A", 1),
>                        sel, dr, di, z, n, sigmar, sigmai, workev, 
>                        F77_CONST_CHAR_ARG2 (&bmat, 1), n,
>                        F77_CONST_CHAR_ARG2 ((typ.c_str ()), 2),
>                        k, tol, presid, p, v, n, iparam,
>                        ipntr, workd, workl, lwork, info2
>                        F77_CHAR_ARG_LEN(1) F77_CHAR_ARG_LEN(1) 
>                        F77_CHAR_ARG_LEN(2));
> 
>                     if (f77_exception_encountered)
>                       {
>                         error ("eigs: unrecoverable exception encountered in 
> dneupd");
>                         goto eigs_err;
>                       }
> 
>                     ComplexColumnVector eig_val (k+1);
> 
> Are all the arrays (not just the work arrays) that are passed to
> dneupd the correct size? 

Not only are they the correct size they are larger than recommended.
Valgrind is still pointing to dneupd writing beyond the end of
particular the variable v, dr, and di not matter how big I make them.
Note eig_val isn't even passed to arpack!

> Are you sure they are not corrupted in some
> way even before the call to dneupd?  It is possible that there is a
> buffer overwriting problem that happens even before that call.

The first relevant (there is also a conditional branch on uninitialized
value in libc vnprintf code when called from Ffprintf) is

==11854==
==11854== Invalid write of size 8
==11854==    at 0x1D52F244: (within /usr/lib/atlas/P4SSE2/libblas.so.3.0)
==11854==  Address 0x1F748060 is 0 bytes after a block of size 56 alloc'd
==11854==    at 0x1B900070: operator new[](unsigned)
(vg_replace_malloc.c:197)
==11854==    by 0x1F97C5F5: Feigs(octave_value_list const&, int)
(eigs.cc:1276)

Note that 1276 for me is a different line than in the eigs I sent as I
modified the code to use new/delete.. In any case this line is

                      double *dr = new double [k + 3];

for me. However I saw exactly the same issue with

                      OCTAVE_LOCAL_BUFFER (double, dr, k + 3);

Note that the dneupd.f file suggests that dr should be "k+1" in size.
However if I make it that small the crash happens at about the 4th
iteration rather than the 10000-th.

> If I had to debug this, I think my strategy would be to eliminate
> Octave from the equation and find out whether I could duplicate the
> crash using a stripped down Fortran-only program.  If the crash could
> be duplicated there, then the bug is either in arpack or my
> understanding of how the code is supposed to be used.  If the crash
> does not happen with the simpler case, then I'm not sure how I would
> isolate the error given the current structure of the code.

Ok, I can try that..

> 
> Since it seems that calls to these functions are relatively complex,
> it would be nice to have another layer around the Fortran at the
> liboctave level so that if someone wanted to use this functionality in
> C++ they could do it more easily.  That is secondary to finding
> out the cause of the crash, but it might help to be able to call this
> code directly from a C++ program without all of Octave in the way.

Yeah, I got a bit carried away, and this oct-file is huge. I'm not sure
where I make the cut though. Into real-symmetric real non-symmetric and
complex cases? Or some I cut it even finer, with the standard and
general eigenvalue problems treated seperately? There are also lots of
sub cases for whether the matrix B in the general problem is a cholesky
factorization or whether A is a matrix or a function.

One thing missing in the current code that will pobably push me to
extract it into a class is the treatment of full matrices. At the moment
they are treated as sparse matrices, though I'm sure that introduces a
speed penalty.

D.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: moving toward a 3.0 release, (continued)
Prev by Date: Re: moving toward a 3.0 release
Next by Date: Re: moving toward a 3.0 release
Previous by thread: Re: moving toward a 3.0 release
Next by thread: Re: moving toward a 3.0 release
Index(es):
- Date
- Thread