reproduce-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #62879] Maneage handling of /dev/shm does not know about slurm


From: Boud Roukema
Subject: [bug #62879] Maneage handling of /dev/shm does not know about slurm
Date: Mon, 8 Aug 2022 16:24:50 -0400 (EDT)

URL:
  <https://savannah.nongnu.org/bugs/?62879>

                 Summary: Maneage handling of /dev/shm does not know about
slurm
                 Project: Maneage
               Submitter: boud
               Submitted: Mon 08 Aug 2022 08:24:49 PM UTC
                Category: Software
                Severity: 3 - Normal
              Item Group: Crash
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Mon 08 Aug 2022 08:24:49 PM UTC By: Boud Roukema <boud>
DESCRIPTION: In both (i) reproduce/software/shell/configure.sh and (ii)
reproduce/software/make/basic.mk we have tests for the amount of RAM available
in /dev/shm. Firstly, these should be associated with each other in some way,
since someone may not realise that there are two separate tests. Secondly,
neither of these take into account RAM limits imposed by a task allocation
manager such as 'slurm' on an HPC (high-performance computing) shared system
of systems. These can lead to a crash because the 'slurm' RAM limit is
violated and Maneage (via (i) and (ii)) cannot protect against that.

EXPERIENCE: This *has* lead to several crashes, with error messages while
compiling gcc such as 

./gt-dwarf2out.h:2673:2: fatal error: error writing to /tmp/ccch3SRR.s: Cannot
allocate memory

where _/tmp_ is also a RAM disk of type _tmpfs_ . Although this error occurs
when trying to write to /tmp, not to /dev/shm, _slurm_ tools presumably check
the total RAM used by the user's task. Increasing the _--mem=..._ parameter
given to _sbatch_ by a few GiB prevents the crashes, and does not intervene in
memory allocation to /tmp.

SUGGESTION:
(1) unify and modularise the two configure.sh and basic.mk tests, in the sense
that the variables setting the required amount of memory should be set in a
single location, such as _reproduce/software/config/ramdisk.conf_;
(2) print out this RAM disk info so that it's visible in the log file; 
(3) think of some way to write an optional way for Slurm (or other task
managers; Slurm is licensed as free software) to interact with Maneage. 

Possibility for (3):

For example, allow a third optional environment variable MANEAGE_RAMDISK_MAX,
which is used in association with the _reproduce/software/config/ramdisk.conf_
variables to decide if it's necessary to either quit or create
_$(BDIR)/software/build-tmp-gcc-due-to-lack-of-space_ and keep going. In this
case, the script submitted to a slurm tool (e.g. _sbatch_) should set
_--mem=${MANEAGE_RAMDISK_MAX}_ and feed MANEAGE_RAMDISK_MAX through to Maneage
(and Maneage would have to allow this through into _configure.sh_ and
_basic.mk_ .)

In this way, Maneage will not need to know what external task manager is being
used; it will only know that for some reason there is an extra constraint on
the ram disk than the output of the command line _df /dev/shm_ .

This third memory constraint parameter should be documented in the .conf
file.

SLURM: 
* https://en.wikipedia.org/wiki/Slurm_Workload_Manager
* https://slurm.schedmd.com








    _______________________________________________________

Reply to this item at:

  <https://savannah.nongnu.org/bugs/?62879>

_______________________________________________
Message sent via Savannah
https://savannah.nongnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]