espressomd-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ESPResSo-users] problems on runing parallel Espresso.


From: Yanping Fan, Liza (Dr)
Subject: [ESPResSo-users] problems on runing parallel Espresso.
Date: Tue, 14 Dec 2010 09:34:50 +0800

Dear all:

I'm writing to seek advice about some problems I met when try to run the my 
script on multiple CPUs.

The script runs pretty fine one a single CPU process, however when I switch to 
Multiple CPUs, 8 cpus within one node for instance, there are problems.

Here's the error message:

0: Script directory: /home/korolev/espresso-2.1.2j_p/scripts

background_errors 0 {079 bond broken between particles 294, 295 and 296 
(particles not stored on the same node)} 6 {079 bond broken between particles 
225, 226 and 227 (particles not stored on the same node)} 7 {079 bond broken 
between particles 331, 332 and 333 (particles not stored on the same node)}

    while executing

"integrate $warm_steps"

    invoked from within

"if { $i < $warm_loops } {

        #  Set LJ cup

        set cap $warm_cap

        inter ljforcecap $warm_cap



        xtc_init "warm-up.xtc"

 ..."

The bond broken, particle stored on different node. My simulation box is 
400A*400A*400A, and all my equilibrium bond lengths are between 20 to 40A. I've 
been suggested to increase the parameter "skin" for my verlet lists, 
(originally I set to 0.5). With 0.5 for skin, it's said there are a quite 
possibility for the two bonded particles to be set up on two different 
processors.

I increased the "skin" to 30, the above error message disappeared, but another 
error occurred:
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit code.  
This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return 0" or 
"exit(0)" in your C code before exiting the application.

PID 13898 failed on node n0 (192.168.2.160) due to signal 11.

If I turned the skin to 30, run it on 2CPU,4CPU,8CPU, they show "Segmentation 
fault" error. Please look at the error message and log file attached. I'm 
thinking of the problem maybe due to espresso distributing particles over 
different nodes, process.

Moreover, the same script runs on different Espresso version, it turns out 
different warnings.
The attached email is NTU high performance computing (HPC) centre staff tested 
the script on different Espresso versions installed in HPC.

Hope someone can give me advice how to fix this. Thank in advance

Best regards,

Liza

CONFIDENTIALITY: This email is intended solely for the person(s) named and may 
be confidential and/or privileged. If you are not the intended recipient, 
please delete it, notify us and do not copy, use, or disclose its content. 
Thank you.

Towards A Sustainable Earth: Print Only When Necessary
--- Begin Message --- Subject: Results for Compliation of Espresso Date: Fri, 3 Dec 2010 09:00:54 +0800

Hi Yanping,

 

Please see below for the results from compilation of different version of Espresso with xtc_write script:

Ver. 2.0.2n

Make check failed.

Error output from NCP-12-6.tcl:

[hpcheadnode1:18749] *** Process received signal ***

[hpcheadnode1:18749] Signal: Segmentation fault (11)

[hpcheadnode1:18749] Signal code: Address not mapped (1)

[hpcheadnode1:18749] Failing at address: 0x2aaaafa55f58

[hpcheadnode1:18749] [ 0] /lib64/libpthread.so.0 [0x3f3fc0e4c0]

[hpcheadnode1:18749] [ 1] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin [0x45ac9c]

[hpcheadnode1:18749] [ 2] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(P3M_charge_assign+0xe9) [0x45a869]

[hpcheadnode1:18749] [ 3] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(calc_long_range_forces+0xc2) [0x432162]

[hpcheadnode1:18749] [ 4] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(force_calc+0x5d) [0x43208d]

[hpcheadnode1:18749] [ 5] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(integrate_vv+0xe4) [0x42e794]

[hpcheadnode1:18751] *** Process received signal ***

[hpcheadnode1:18751] Signal: Segmentation fault (11)

[hpcheadnode1:18751] Signal code: Address not mapped (1)

[hpcheadnode1:18751] Failing at address: 0x2aaaafa54b58

[hpcheadnode1:18749] [ 6] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(mpi_integrate_slave+0x8) [0x419df8]

[hpcheadnode1:18749] [ 7] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(mpi_loop+0x5e) [0x41cd5e]

[hpcheadnode1:18749] [ 8] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(main+0x67) [0x417257]

[hpcheadnode1:18749] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f3f01d974]

[hpcheadnode1:18749] [10] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin [0x417139]

[hpcheadnode1:18749] *** End of error message ***

[hpcheadnode1:18751] [ 0] /lib64/libpthread.so.0 [0x3f3fc0e4c0]

[hpcheadnode1:18751] [ 1] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin [0x45ac9c]

[hpcheadnode1:18751] [ 2] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(P3M_charge_assign+0xe9) [0x45a869]

[hpcheadnode1:18751] [ 3] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(calc_long_range_forces+0xc2) [0x432162]

[hpcheadnode1:18751] [ 4] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(force_calc+0x5d) [0x43208d]

[hpcheadnode1:18751] [ 5] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(integrate_vv+0xe4) [0x42e794]

[hpcheadnode1:18751] [ 6] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(mpi_integrate_slave+0x8) [0x419df8]

[hpcheadnode1:18751] [ 7] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(mpi_loop+0x5e) [0x41cd5e]

[hpcheadnode1:18751] [ 8] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin(main+0x67) [0x417257]

[hpcheadnode1:18751] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f3f01d974]

[hpcheadnode1:18751] [10] ~/espresso-2.0.2n/lib/espresso/obj-unknown_CPU-pc-linux/Espresso_bin [0x417139]

[hpcheadnode1:18751] *** End of error message ***

--------------------------------------------------------------------------

mpirun noticed that process rank 3 with PID 18751 on node hpcheadnode1 exited on signal 11 (Segmentation fault).

--------------------------------------------------------------------------

 

Ver. 2.1.2j

Make check passed.

Results output from NCP-12-6.tcl shows ok for mpirun, but it stops half-way.

 

Ver. 2.2.0b

Make check passed.

Error output from NCP-12-6.tcl:

 

warm-up.xtc initialized...

background_errors 2 {079 bond broken between particles 96, 97 and 98 (particles not stored on the same node)}

    while executing

"integrate $warm_steps"

    invoked from within

"if { $i < $warm_loops } {

        #  Set LJ cup

        set cap $warm_cap

        inter ljforcecap $warm_cap

 

        xtc_init "warm-up.xtc"

..."

    (file "NCP-12-6.tcl" line 783)

 

It appears that ver. 2.0.2n is not compatible with the current version of MPI that I used. Version 2.1.2j can run, but the process stop half way without showing any error. Version 2.2.0b seems to have optimization changes needed to be done in the input file in order for it to run properly.

 

If you have been successful in running your NCP-12-6.tcl job using any of the version above at somewhere else, please do let me know, so that I could compare which parts went wrong. This may take quite a bit of time to find out what actually went wrong.

 

Thanks with regards,

Hon Wai, LEONG

On behalf of

High Performance Computing Centre

Nanyang Technological University

Contact No.: 65922415

Website: http://hpc.ntu.edu.sg

 



CONFIDENTIALITY: This email is intended solely for the person(s) named and may be confidential and/or privileged. If you are not the intended recipient, please delete it, notify us and do not copy, use, or disclose its content. Thank you.

Towards A Sustainable Earth: Print Only When Necessary

--- End Message ---

Attachment: esp_0001.err.log
Description: esp_0001.err.log

Attachment: esp_0001.log
Description: esp_0001.log


reply via email to

[Prev in Thread] Current Thread [Next in Thread]