Re: Parallelisation

From: Ahmad Reza Motezakker
Subject: Re: Parallelisation
Date: Fri, 14 Jan 2022 14:01:32 +0000

Dear Rudolf,

Thank you for help. I tried it and below please find the information:

mpirun -np 4 ./pypresso  .../maintainer/benchmarks/ --particles_per_core=20000 --lb_sites_per_particle 6

and I get average timing of :    1.321e-01 +/- 7.118e-04 (95% C.I.)  on IntelĀ® Xeon(R) Silver 4114 CPU @ 2.20GHz with 20 cores.

a general question: In my system (polymer+LJ+LB), Do you think the box size affect the results? I am asking because for cases with high concentration, the timing is really slow, so I was thinking to make the box smaller with the same high concentration but with less number of particles. 

Thank you very much,

Ahmad Reza

From: Rudolf Weeber <>
Sent: Thursday, January 13, 2022 11:18:25 AM
To: Ahmad Reza Motezakker
Subject: Re: Parallelisation
Hi Ahmad,
On Wed, Jan 12, 2022 at 01:35:35PM +0000, Ahmad Reza Motezakker wrote:

> I have a suspension of polymers coupled with fluid. (LJ+LB)
> Here are the parameters:
> box_l = 300*sigma  (box is a cube)
> number of polymers = 300
> beads per polymer = 26
> All the particles = 300*26 =7800
> LJ cut = sigma*(2**(1/6))
> l_skin = 8.3 *sigma (set it thid to have 31cells in each direction)
> LB cells = 50
> number of cells in each direction = 31
> Timing for 100  productive run after setting the system and warming up:
> 1core     17.824 s
> 2core      16.22 s
> 4core      15.93 s
> 8core       17.83 s
Can you please report the timings obtained via

mpirun -np 4 ./pypresso  ../maintainer/benchmarks/ --particles_per_core=20000 --lb_sites_per_particle 6
These are 80k particles with a 78^3 LB, so slightly bigger than your system. I get about 80ms per time step on an AMD Ryzen 1920x Threadripper with 12 cores.
You can also check on 8 cores by using -np 8  and --particles_per_core=10000. On my system, this is not worth it.
You can get a significantly faster simulation by using the GPU LB. The speedup relies partially on the fact that GPUs are very well suited for LB (e.g. because of high memory band width) but also on the fact that Espresso's GPU LB uses single precision, whereas the CPU one uses double precision.

> If I want get one node with 128 cores on cluster and only use 4 of them, the cluster support will not be happy.
On some clusters, it is possible to request just a part of a node (shared node usage). Otherwise, it may be possible to run several instances of Espresso at the same time on the cluster node.

Regards, Rudolf

