espressomd-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ESPResSo-users] Cuda Memory Error


From: Georg Rempfer
Subject: Re: [ESPResSo-users] Cuda Memory Error
Date: Tue, 22 Mar 2016 12:54:13 +0100

The relevant function is in lbgpu_cuda:cu:3558.

void lb_save_checkpoint_GPU(float *host_checkpoint_vd, unsigned int *host_checkpoint_seed, unsigned int *host_checkpoint_boundary, lbForceFloat *host_checkpoint_force){

  cuda_safe_mem(cudaMemcpy(host_checkpoint_vd, current_nodes->vd, lbpar_gpu.number_of_nodes * 19 * sizeof(float), cudaMemcpyDeviceToHost));
  cuda_safe_mem(cudaMemcpy(host_checkpoint_seed, current_nodes->seed, lbpar_gpu.number_of_nodes * sizeof(unsigned int), cudaMemcpyDeviceToHost));
  cuda_safe_mem(cudaMemcpy(host_checkpoint_boundary, current_nodes->boundary, lbpar_gpu.number_of_nodes * sizeof(unsigned int), cudaMemcpyDeviceToHost));
  cuda_safe_mem(cudaMemcpy(host_checkpoint_force, node_f.force, lbpar_gpu.number_of_nodes * 3 * sizeof(lbForceFloat), cudaMemcpyDeviceToHost));

}

As far as I see, this should not require any additional GPU memory. Can you try commenting these cudaMemcpy lines, recompiling and rerunning. If that works, comment them back in one by one, recompile and run. That way we will find out what exactly breaks.

Can you show me your lbgpu_cuda.cu:3572? In my version, this is a comment line.

We suspect that this is not a memory limitation, but that something else is broken.


On Tue, Mar 22, 2016 at 12:34 PM, Wink, Markus <address@hidden> wrote:

The problem occurs the first time the line is executed.  Thank’s for looking it up!

 

 

Von: address@hidden [mailto:address@hidden] Im Auftrag von Georg Rempfer
Gesendet: Dienstag, 22. März 2016 12:04


An: Wink, Markus
Cc: address@hidden
Betreff: Re: [ESPResSo-users] Cuda Memory Error

 

Is this line executed the first time when the problem happens? In that case your memory is actually too small (I'll look at the malloc in a second to see how much is needed). Or has this line worked once or several time already? In that case there is a memory leak.

 

On Tue, Mar 22, 2016 at 11:54 AM, Wink, Markus <address@hidden> wrote:

True.. sorry for that.

 

I guess I found the line in my script that is causing the error. I was aiming to save the state of the fluid (lbfluid load_ascii_checkpoint). When calling that, the maximum memory is exceeded.

 

Do you have a rule of thumb, how much memory the lbfluid load_ascii_checkpoint command needs on the GPU (maybe as a function of simulation box-size)?

 

Greetings

 

Markus

 

 

Von: address@hidden [mailto:address@hidden] Im Auftrag von Georg Rempfer
Gesendet: Dienstag, 22. März 2016 11:48
An: Wink, Markus
Cc: address@hidden
Betreff: Re: [ESPResSo-users] Cuda Memory Error

 

I assume by RAM you mean the memory of the GPU?

 

On Tue, Mar 22, 2016 at 11:22 AM, Wink, Markus <address@hidden> wrote:

Hello everybody,

 

I want to simulate a quite big system (1200x300x130 LB-nodes) on a GPU. The Ram is sufficient (12GB) and I can start the simulation. Nevertheless after a few integration steps the simulation stops with the error message shown at the bottom of the mail.

 

I checked the GPU’s memory handling during the simulation and I realized, that the memory, that is needed for the simulation increases with time (the simulation crashes when there is no memory left on the GPU).

 

What is the reason, that the memory needed increases with time? Is there a asymptotic maximum value for the memory needed? Can I somehow avoid the increase?

 

Greetings

 

Markus

 

Cuda Memory error at /home/wink/Dokumente/espresso-master/20150804_fixed/espresso-master/src/core/lbgpu_cuda.cu:3572.

CUDA error: invalid argument

You may have tried to allocate zero memory at /home/wink/Dokumente/espresso-master/20150804_fixed/espresso-master/src/core/lbgpu_cuda.cu:3572.

--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 0 in communicator MPI_COMMUNICATOR 3

with errorcode -1.

 

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--------------------------------------------------------------------------

 

 

 

 

 

 

 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]