espressomd-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ESPResSo-users] mpi and compressed block files


From: Martin Lindén
Subject: Re: [ESPResSo-users] mpi and compressed block files
Date: Fri, 07 Sep 2012 13:43:40 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120827 Thunderbird/15.0

Dear Axel,

thanks for looking into this. I realize that I was not quite clear about
what the problem is. There are indeed a warnings if the number of cores
differ between reading and writing, and getting rid of those as you
suggest seems like a good idea. I believe they are separate from what is
causing the error though.

The script I attached actually reads two files with identical contents:
a text file, and a gzipped version of it. The script only crashes on the
gzipped one, and only when run in MPI mode with several (>2) cores. It
also does not matter if the text file is read before the gz file or not.

If the error was related to the content of the files, such as name
clashes in stored variables, I would expect the same behavior from both
files, since their content is identical? This is not the case though.

Instead, I think the problem occurs when trying to close the channel
associated with the gzipped file:
>> error waiting for process to exit: child process lost (is SIGCHLD
>> ignored or trapped?)
>>      while executing
>> "close $innnn"
>>      (file "blockread3.tcl" line 14)

Sincerely,

Martin

PS: the complete output from running this script on my machine is as
follows.

address@hidden:~/tmp/espresso_bug$ mpirun -n 4 Espresso blockread3.tcl
*******************************************************
*                                                     *
*                    - ESPResSo -                     *
*                    ============                     *
*        A Parallel Molecular Dynamics Program        *
*                                                     *
* (c) 2010,2011,2012                                  *
* The ESPResSo project                                *
*                                                     *
* (c) 2002,2003,2004,2005,2006,2007,2008,2009,2010    *
* Max-Planck-Institute for Polymer Research           *
* Mainz, Germany                                      *
*                                                     *
*******************************************************

This is ESPResSo-3.1.0.

ESPResSo is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

ESPResSo is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

ESPResSo-3.1.0
{ Compilation status { FFTW } { BOND_ANGLE_HARMONIC } { LENNARD_JONES }
{ LJCOS } { LJCOS2 } { MPI_CORE } { EXCLUSIONS } }
reading a small uncompressed block file...
WARNING: node_grid incompatible with current n_nodes, ignoring
... done.
reading a small compressed block file...
WARNING: node_grid incompatible with current n_nodes, ignoring
error waiting for process to exit: child process lost (is SIGCHLD
ignored or trapped?)
    while executing
"close $innnn"
    (file "blockread3.tcl" line 14)
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
address@hidden:~/tmp/espresso_bug$

On 09/06/2012 10:47 PM, Axel Arnold wrote:
> Dear Martin,
> 
> this is basically the same problem as you ran already into with reading
> all Tcl variables. You read back all values, and some are incompatible
> with changing the number of nodes.  Here, this is the processor node
> grid, which of course has to fit the number of nodes as the checkpoint
> was written, i.e., you can only read this variable back if you don't
> change the number of processors.
> 
> Just like for tcl variables, there are also blacklists for not reading
> back certain variables from the setmd variables, see the user's guide,
> and you can specify which variables you really need to reset. However,
> in the case of the checkpoints, there is one more concern that you
> should be aware of: if you use a thermostat that relies on random
> numbers, such as the standard Langevin, then the random numbers will be
> only reproducable if you use the same node_grid (and hence, number of
> nodes), and restore the random seeds. Therefore, for true checkpointing,
> you need to save node_grid and restore it, on the same number of nodes.
> In addition, you need to unconditionally recreate the Verlet lists,
> which requires the command "invalidate_system" right after writing the
> checkpoint.
> 
> In your case, it seems that you are just creating the setup serially,
> and then want to go parallel. In this case, saving random seeds etc is
> not necessary, and you should only save those setmd variables, that you
> actually changed during your setup script.
> 
> Cheers,
> Axel
> 
> On 09/06/2012 02:50 PM, Martin Lindén wrote:
>> Hi!
>>
>> I am fairly new to Espresso, and have some trouble with reading
>> checkpoints, as described at the end of Sec. 10.1.7 in the users guide
>> for 3.1.0.
>>
>> To reproduce the problem:
>>
>> 1. Run blockread3.tcl in serial mode. This reads a uncompressed and a
>> compressed version of a blockfile (idential content), and works as
>> expected.
>>> Espresso blockread3.tcl
>> 2. Run in mpi mode with one processor. Somewhat artificial, but works:
>>> mpirun -n 1 Espresso blockread3.tcl
>> 3. The problem is mpi on multiple processors:
>>> mpirun -n 4 Espresso blockread3.tcl
>> (...)
>>
>> WARNING: node_grid incompatible with current n_nodes, ignoring
>> error waiting for process to exit: child process lost (is SIGCHLD
>> ignored or trapped?)
>>      while executing
>> "close $innnn"
>>      (file "blockread3.tcl" line 14)
>> --------------------------------------------------------------------------
>>
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>>
>>
>> Two processors (mpirun -n 2 ...) sometimes go through, and sometimes
>> crashes, but more than two always crashes on my system.
>>
>> A temporary fix is of course to stay away from compressing the block
>> files. But it would be nice to be able to work with compressed files
>> when I go to larger systems.
>>
>>
>> System info:
>>
>> ESPResSo-3.1.0
>> { Compilation status { FFTW } { BOND_ANGLE_HARMONIC } { LENNARD_JONES }
>> { LJCOS } { LJCOS2 } { MPI_CORE } { EXCLUSIONS } }
>>
>> mpirun (Open MPI) 1.5.4
>>
>> gzip 1.4
>>
>> ubuntu 12.04 64 bit.
>>
>> Sincerely,
>>
>> Martin
> 
> 


Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]