Farming

From Planets
Jump to: navigation, search

On this page, we present working examples on how to "farm" simulations on computing clusters.

On adastra (CINES)

One node on adastra is 192 cores. When you favorite PCM configuration does not scale easily to run on a full node, you can use shared nodes, but there are not a lot of them and you end up queueing a lot...

If you opt for the

#SBATCH --exclusive

option, you'll get all the node for you, launch your jobs faster, but all the 192 cores will be billed for hours...

To make your runs more efficient, you can group several simulations on the same node, so that they run at the same time on your exclusive node, under the same launching script. You share only with you, on an exclusive node ! This means you avoid wasting unused-core hours.

To do this, here is an example :

#!/bin/bash
#SBATCH --account=cin0391
#SBATCH --job-name=vcd2dot4
#SBATCH --output=%x_%A.out
#SBATCH --error=%x_%A.err
#SBATCH --constraint=GENOA
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --time=24:00:00

### source your environment here

ulimit -s unlimited
set -eux

workdir1=<path-to-yor-first-job>
workdir2=<path-to-yor-second-job>

cd $workdir1
### Activities to prepare your first job
cd $workdir2
### Activities to prepare your second job

#### launch first job:
cd $workdir1
srun --ntasks-per-node=96 --cpus-per-task=1 --threads-per-core=1 -- your.exe > output1 2>&1 &
PID1=$!
#### launch second job:
cd $workdir2
srun --ntasks-per-node=96 --cpus-per-task=1 --threads-per-core=1 -- your.exe > output2 2>&1 &
PID2=$!

wait 
status=$?
echo $PID1.$status
echo $PID2.$status
date

### activities after the end of both jobs

Each job runs in its own directory, and the last part of the script will be executed when both jobs are done. Here, 96 cores are used for each job. You can share the cores as you wish, but keep jobs in balance so that they finish roughly at the same time. You can also do this with 4 jobs on 48 cores, or 8 on 24 cores, etc...

Be aware of the use of memory ! If your jobs need more memory than the one allocated for its corresponding cores, you will need to waste cores to get all the memory each job needs.

Example: each node has 768 GB of memory, so 4 GB per core. If your job needs 8 GB/core, you will be able to use only 96 cores.