[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
## Re: Question about checkpoint

**From**: |
Jean-Noël Grad |

**Subject**: |
Re: Question about checkpoint |

**Date**: |
Thu, 09 Dec 2021 17:13:32 +0100 |

**User-agent**: |
Roundcube Webmail/1.3.17 |

Dear Lili Zeng,

`Thank you for sharing your observations. The checkpointing mechanism
``doesn't restore all global variables. Some of them are regenerated from
``the state of other global variables. For example, the cell structure
``global variable stores raw pointers to the particles, but these pointers
``are meaningless when reloading from a checkpoint, and so the cell
``structure global has to be regenerated by placing "new" particles in an
``empty simulation box based on the particles positions stored in the
``checkpoint file. This causes unintended side-effects when reloading a
``simulation from a checkpoint, because a cell resort is triggered when
``particles are placed in the simulation box. In particular, cached values
``of the integrator are invalidated. This can lead to an error in the
``calculated forces in the order of 1e-4. If the time step is 1e-2, the
``particle positions will be off by 1e-8 at the next integration step and
``the trajectory will diverge.
`

`Please find attached a minimal working example to replicate this
``behavior. It simulates a single particle for 4 time steps using the
``Langevin thermostat, with a "perturbation" at the third time step. This
``perturbation can be a reload from a checkpoint (--perturbation load), a
``cell resort (--perturbation resort) or an invalidation of the cached
``integrator values (--perturbation recalc). The reference trajectory is
``obtained without perturbation (--perturbation continue), which also
``saves a checkpoint before the third time step. Here are the results for
``ESPResSo 4.2-dev:
`
$ ./pypresso checkpoint_mwe.py --perturbation continue

`rng_counter = 1, p.pos=array([ 0.00084648, -0.00009886, 0.00006749])
``p.f=array([ 15.00509416, 4.49980631, -23.50091465])
``rng_counter = 2, p.pos=array([ 0.00319346, 0.00025227, -0.00221511])
``p.f=array([ 24.24592812, 4.85133902, -17.03844439])
``rng_counter = 3, p.pos=array([ 0.00796504, 0.00108853, -0.00620156])
``p.f=array([ 14.41245656, 9.68804012, 22.38072167])
``rng_counter = 4, p.pos=array([ 0.01417787, 0.00289359, -0.00794993])
``p.f=array([-14.30846688, -4.63255625, 24.10474287])
`$ ./pypresso checkpoint_mwe.py --perturbation load
rng_counter = 1, unknown
rng_counter = 2, unknown

`rng_counter = 3, p.pos=array([ 0.00795898, 0.00108732, -0.00619730])
``p.f=array([ 14.41306271, 9.68816141, 22.38029571])
``rng_counter = 4, p.pos=array([ 0.01416581, 0.00289118, -0.00794146])
``p.f=array([-14.30786679, -4.63243618, 24.10432117])
`$ ./pypresso checkpoint_mwe.py --perturbation resort

`rng_counter = 1, p.pos=array([ 0.00084648, -0.00009886, 0.00006749])
``p.f=array([ 15.00509416, 4.49980631, -23.50091465])
``rng_counter = 2, p.pos=array([ 0.00319346, 0.00025227, -0.00221511])
``p.f=array([ 24.24592812, 4.85133902, -17.03844439])
``rng_counter = 3, p.pos=array([ 0.00795898, 0.00108732, -0.00619730])
``p.f=array([ 14.41306271, 9.68816141, 22.38029571])
``rng_counter = 4, p.pos=array([ 0.01416581, 0.00289118, -0.00794146])
``p.f=array([-14.30786679, -4.63243618, 24.10432117])
`$ ./pypresso checkpoint_mwe.py --perturbation recalc

`rng_counter = 1, p.pos=array([ 0.00084648, -0.00009886, 0.00006749])
``p.f=array([ 15.00509416, 4.49980631, -23.50091465])
``rng_counter = 2, p.pos=array([ 0.00319346, 0.00025227, -0.00221511])
``p.f=array([ 24.24592812, 4.85133902, -17.03844439])
``rng_counter = 3, p.pos=array([ 0.00795898, 0.00108732, -0.00619730])
``p.f=array([ 14.41306271, 9.68816141, 22.38029571])
``rng_counter = 4, p.pos=array([ 0.01416220, 0.00288876, -0.00794705])
``p.f=array([-14.30750646, -4.63219397, 24.10488068])
`

`According to these results, the trajectory after a reload is similar to
``the trajectory one would obtain by forcing a particle resort, and
``diverges by 1e-5 in unit of length at the third time step. This might
``not be the only contributing factor for the drift you observed, but
``characterizing which other components of ESPResSo are affected by the
``reload is not a trivial task since many features of ESPResSo behave
``differently depending on which other features are currently active.
`

`Unfortunately, the checkpointing feature of ESPResSo is still at an
``experimental stage and is not well maintained. We have checkpointing
``tests to guarantee that the state of particles, thermostats, integrators
``and numerical solvers is correctly reloaded. However the cell system
``isn't properly reloaded at the moment, and therefore isn't tested. I
``would not recommended enabling checkpointing if you need reproducible
``trajectories.
`
Best,
JN
On 2021-12-07 20:21, Lili Zeng wrote:

Hi,
I'm a PhD student at McGill University in Montreal, Canada, studying
polymer physics, and I'm a user of ESPResSo (version 4.1). I have a
question about checkpoints. I noticed that when I run a simulation
(say, with 2000 timesteps) in one shot, the final positions of my
particles are different from when I run the first half of an identical
simulation (so 1000 timesteps), then save under checkpoint, restore
the simulation using checkpoint, and run to completion (another 1000
timesteps). This is true even for the simplest systems, eg one single
particle suspended in space with no interactions. In this case, the
difference in the final particle position is very small (on order of
10^(-6) for a particle of size 1 in a system box of size 50), but it
still exists. I wanted to ask whether this is normal, and whether
there is any way to get identical results using checkpoint compared to
when not using checkpoint.
Thank you!
Lili

**
**`checkpoint_mwe.py`

*Description:* Text Data