Parallelism

From Planets
Jump to: navigation, search

This page comes mainly from the LMD Generic GCM user manual (https://trac.lmd.jussieu.fr/Planeto/browser/trunk/LMDZ.GENERIC/ManualGCM_GENERIC.pdf). It is still in development and needs further improvements

What is parallelism?

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time.

In short : Parallelism can help you save time: One will get the same results when running on more cores than if running the serial version of the code, but sooner.

Indeed, as the problem is cut into smaller part that are solved simultaneously, the waiting (wall clock) time for the user is reduced. However this usually comes with a counterpart (overheads and extra computations due to the parallelization, along with some inherent inefficiences concerning some computations which must be done sequentially), it can increase the total computational cost.

How parallelism is implemented in the model

The main factor that constrains and orients the way the code is parallelized is that in the physics, atmospheric columns are "independent" from each other, whereas in the dynamics the flow is 3D with strong coupling between neighboring cells.

Parallelism with the lon-lat (LMDZ) dynamical core

  • MPI tiling: In the lon-lat dynamics the globe is tiled in regions covering all longitudes and a few latitudes. In practice these latitude bands must contain at least 2 points. There is therefore a limitation to the number of MPI processes one may run with: for a given number of latitude intervals jjm one may use at most jjm/2 processes (for example if the horizontal grid is 64x48 in lonxlat so one could use at most 48/2=24 MPI processes.
  • Open MP (OMP): In the dynamics this parallelism is implemented on the loops along the vertical. One could thus use as many OpenMP threads are there are model levels. In practice however the speedup breaks down with much less and it is recommended to have OpenMP chunks of at least ten vertical levels each. Therefore for simulation with llm altitude levels one should target using at most llm/10 OpenMP threads (e.g. for a 64x48x54 grid target using at most 5 OpenMP threads).
  • In practice: One will want to use both MPI and OpenMP for simulations, with as many MPI processes as possible, combined to a good number of OpenMP threads (for each MPI process). Depending on the cluster used, the speedup as a function of number of MPI processes and OpenMP threads can vary a lot. It is therefore recommended to test it to find the "optimal" setup for a given grid.

How to compile the parallel version of the PCM

To compile the model in parallel use the same command as in sequential (see e.g. the "Compiling" section of Quick Install and Run or the description of the makelmdz_fcm script and its options) and add the following option :

 -parallel

Then there is three choices for parallelism MPI, OMP and mixed (i.e. combined) MPI_OMP:

 -parallel mpi
 -parallel omp
 -parallel mpi_omp

So the command line to generate the Generic PCM to run in mixed MPI and OpenMP mode will be for example :

./makelmdz_fcm -s XX -d LONxLATxALT -b IRxVI -p std -arch archFile -parallel mpi_omp gcm

How to run in parallel

Run interactively

  • MPI only :
mpirun -np N gcm.e > gcm.out 2>&1

-np N specifies the number of procs to run on.

IMPORTANT: one MUST use the mpirun command corresponding to the mpif90 compiler specified in the arch file.

IMPORTANT: One can use at most one MPI process for every 2 points along the latitude (e.g. a maximum of 24 processes for a horizontal grid of 64x48). If you try to use too many MPI processes you will get the following error message (in French!!):

 Arret : le nombre de bande de lattitude par process est trop faible (<2).
  ---> diminuez le nombre de CPU ou augmentez la taille en lattitude

Output files (restart.nc, diagfi.nc ,etc.) are just as when running in serial. But standard output messages are written by each process. If using chained simulations (run mcd/run0 scripts), then the command line to run the gcm in run0 must be adapted for local settings.

NB: LMDZ.COMMON dynamics set to run in double precision, so keep NC_DOUBLE declaration (and real to double precision promotion) in the arch files.


  • Mix MPI_OMP :
export OMP_NUM_THREADS=2
export OMP_STACKSIZE=2500MB
mpirun -np 2 gcm.e > gcm.out 2>&1

In this exemple, each of the 2 process MPI have 2 OpenMP tasks with a 2500MB memor.

Run with a job scheduler

This will be different for each machine. Some example are provided here but will need to be adapted for each configuration and machine; see also pages dedicated to some clusters we use, such as Using the MESOIPSL cluster or Using Irene Rome or Using Adastra

  • MPI only :
PBS example (on Ciclad):
#PBS -S /bin/bash
#PBS -N job_mpi08
#PBS -q short
#PBS -j eo
#PBS -l "nodes=1:ppn=8"
# go to directory where the job was launched
cd $PBS_O_WORKDIR
mpirun gcm_64x48x29_phymars_para.e > gcm.out 2>&1
LoadLeveler example (on Gnome):
# @ job_name = job_mip8
# standard output file
# @ output = job_mpi8.out.$(jobid)
# standard error file
# @ error = job_mpi8.err.$(jobid)
# job type
# @ job_type = mpich
# @ blocking = unlimited
# time
# @ class = AP
# Number of procs
# @ total_tasks = 8
# @ resources=ConsumableCpus(1) ConsumableMemory(2500 mb)
# @ queue
set -vx
mpirun gcm_32x24x11_phymars_para.e > gcm.out 2>&1
LoadLeveler example (on Ada):
module load intel/2012.0
# @ output = output.$(jobid)
# @ error = $(output)
# @ job_type = parallel
## Number of MPI process
# @ total_tasks = 8
## Memory used by each MPI process
# @ as_limit = 2500mb
# @ wall_clock_limit=01:00:00
# @ core_limit = 0
# @ queue
set -x
poe ./gcm.e -labelio yes > LOG 2>&1
  • Mix MPI_OMP :
LoadLeveler example (on Gnome):
# @ job_name = job_mip8
# standard output file
# @ output = job_mpi8.out.$(jobid)
# standard error file
# @ error = job_mpi8.err.$(jobid)
# job type
# @ job_type = mpich
# @ blocking = unlimited
# time
# @ class = AP
# Number of procs
# @ total_tasks = 8
# @ resources=ConsumableCpus(1) ConsumableMemory(5000 mb)
# @ queue
set -vx
export OMP_NUM_THREADS=2 #sinon par defaut, lance 8 threads OpenMP
export OMP_STACKSIZE=2500MB
mpirun gcm_32x24x11_phymars_para.e > gcm.out 2>&1

IMPORTANT: ConsumableMemory must be equal to OMP NUM THREADSxOMP STACKSIZE. In this case, we are using 8x2 cores.

LoadLeveler example (on Ada):
module load intel/2012.0
# @ output = output.$(jobid)
# @ error = $(output)
# @ job_type = parallel
## Number of MPI process
# @ total_tasks = 8
## Number of OpenMP tasks attached to each MPI process
# @ parallel_threads = 2
## Memory used by each MPI process
# @ as_limit = 5gb
# @ wall_clock_limit=01:00:00
# @ core_limit = 0
# @ queue
set -x
export OMP_STACKSIZE=2500MB
poe ./gcm.e -labelio yes > LOG 2>&1

IMPORTANT: In this case, each core needs 2.5gb and we are using 2 OpenMP tasks for each MPI process so as_limit = 2 × 2.5.