[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ESPResSo-users] mpi and compressed block files

From: Axel Arnold
Subject: Re: [ESPResSo-users] mpi and compressed block files
Date: Thu, 06 Sep 2012 22:47:56 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120827 Thunderbird/15.0

Dear Martin,

this is basically the same problem as you ran already into with reading all Tcl variables. You read back all values, and some are incompatible with changing the number of nodes. Here, this is the processor node grid, which of course has to fit the number of nodes as the checkpoint was written, i.e., you can only read this variable back if you don't change the number of processors.

Just like for tcl variables, there are also blacklists for not reading back certain variables from the setmd variables, see the user's guide, and you can specify which variables you really need to reset. However, in the case of the checkpoints, there is one more concern that you should be aware of: if you use a thermostat that relies on random numbers, such as the standard Langevin, then the random numbers will be only reproducable if you use the same node_grid (and hence, number of nodes), and restore the random seeds. Therefore, for true checkpointing, you need to save node_grid and restore it, on the same number of nodes. In addition, you need to unconditionally recreate the Verlet lists, which requires the command "invalidate_system" right after writing the checkpoint.

In your case, it seems that you are just creating the setup serially, and then want to go parallel. In this case, saving random seeds etc is not necessary, and you should only save those setmd variables, that you actually changed during your setup script.


On 09/06/2012 02:50 PM, Martin Lindén wrote:

I am fairly new to Espresso, and have some trouble with reading
checkpoints, as described at the end of Sec. 10.1.7 in the users guide
for 3.1.0.

To reproduce the problem:

1. Run blockread3.tcl in serial mode. This reads a uncompressed and a
compressed version of a blockfile (idential content), and works as expected.
Espresso blockread3.tcl
2. Run in mpi mode with one processor. Somewhat artificial, but works:
mpirun -n 1 Espresso blockread3.tcl
3. The problem is mpi on multiple processors:
mpirun -n 4 Espresso blockread3.tcl

WARNING: node_grid incompatible with current n_nodes, ignoring
error waiting for process to exit: child process lost (is SIGCHLD
ignored or trapped?)
     while executing
"close $innnn"
     (file "blockread3.tcl" line 14)
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

Two processors (mpirun -n 2 ...) sometimes go through, and sometimes
crashes, but more than two always crashes on my system.

A temporary fix is of course to stay away from compressing the block
files. But it would be nice to be able to work with compressed files
when I go to larger systems.

System info:

{ Compilation status { FFTW } { BOND_ANGLE_HARMONIC } { LENNARD_JONES }

mpirun (Open MPI) 1.5.4

gzip 1.4

ubuntu 12.04 64 bit.



JP Dr. Axel Arnold           Tel: +49 711 685 67609
ICP, Universität Stuttgart   Email: address@hidden
Pfaffenwaldring 27
70569 Stuttgart, Germany

reply via email to

[Prev in Thread] Current Thread [Next in Thread]