[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug #62879] Maneage handling of /dev/shm does not know about slurm
From: |
Boud Roukema |
Subject: |
[bug #62879] Maneage handling of /dev/shm does not know about slurm |
Date: |
Mon, 8 Aug 2022 16:24:50 -0400 (EDT) |
URL:
<https://savannah.nongnu.org/bugs/?62879>
Summary: Maneage handling of /dev/shm does not know about
slurm
Project: Maneage
Submitter: boud
Submitted: Mon 08 Aug 2022 08:24:49 PM UTC
Category: Software
Severity: 3 - Normal
Item Group: Crash
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
_______________________________________________________
Follow-up Comments:
-------------------------------------------------------
Date: Mon 08 Aug 2022 08:24:49 PM UTC By: Boud Roukema <boud>
DESCRIPTION: In both (i) reproduce/software/shell/configure.sh and (ii)
reproduce/software/make/basic.mk we have tests for the amount of RAM available
in /dev/shm. Firstly, these should be associated with each other in some way,
since someone may not realise that there are two separate tests. Secondly,
neither of these take into account RAM limits imposed by a task allocation
manager such as 'slurm' on an HPC (high-performance computing) shared system
of systems. These can lead to a crash because the 'slurm' RAM limit is
violated and Maneage (via (i) and (ii)) cannot protect against that.
EXPERIENCE: This *has* lead to several crashes, with error messages while
compiling gcc such as
./gt-dwarf2out.h:2673:2: fatal error: error writing to /tmp/ccch3SRR.s: Cannot
allocate memory
where _/tmp_ is also a RAM disk of type _tmpfs_ . Although this error occurs
when trying to write to /tmp, not to /dev/shm, _slurm_ tools presumably check
the total RAM used by the user's task. Increasing the _--mem=..._ parameter
given to _sbatch_ by a few GiB prevents the crashes, and does not intervene in
memory allocation to /tmp.
SUGGESTION:
(1) unify and modularise the two configure.sh and basic.mk tests, in the sense
that the variables setting the required amount of memory should be set in a
single location, such as _reproduce/software/config/ramdisk.conf_;
(2) print out this RAM disk info so that it's visible in the log file;
(3) think of some way to write an optional way for Slurm (or other task
managers; Slurm is licensed as free software) to interact with Maneage.
Possibility for (3):
For example, allow a third optional environment variable MANEAGE_RAMDISK_MAX,
which is used in association with the _reproduce/software/config/ramdisk.conf_
variables to decide if it's necessary to either quit or create
_$(BDIR)/software/build-tmp-gcc-due-to-lack-of-space_ and keep going. In this
case, the script submitted to a slurm tool (e.g. _sbatch_) should set
_--mem=${MANEAGE_RAMDISK_MAX}_ and feed MANEAGE_RAMDISK_MAX through to Maneage
(and Maneage would have to allow this through into _configure.sh_ and
_basic.mk_ .)
In this way, Maneage will not need to know what external task manager is being
used; it will only know that for some reason there is an extra constraint on
the ram disk than the output of the command line _df /dev/shm_ .
This third memory constraint parameter should be documented in the .conf
file.
SLURM:
* https://en.wikipedia.org/wiki/Slurm_Workload_Manager
* https://slurm.schedmd.com
_______________________________________________________
Reply to this item at:
<https://savannah.nongnu.org/bugs/?62879>
_______________________________________________
Message sent via Savannah
https://savannah.nongnu.org/
- [bug #62879] Maneage handling of /dev/shm does not know about slurm,
Boud Roukema <=