Difference between revisions of "Running the Venus PCM in parallel"

From Planets
Jump to: navigation, search
Line 49: Line 49:
 
For what it's worth here are some performance results when running 0.05Vday (96x96x78 grid, with chemistry) with different number of processes, on the Irene-Rome supercomputer:
 
For what it's worth here are some performance results when running 0.05Vday (96x96x78 grid, with chemistry) with different number of processes, on the Irene-Rome supercomputer:
 
<pre>
 
<pre>
Number of processes | Run time      | Speedup wrt 16 cores
+
Number of processes | Run time      | Speedup wrt 8 cores          | Speedup wrt 16 cores
       16          |  4h05 = 245mn  |
+
        8          |  6h15 = 375mn  |                              |
       24          |  3h02 = 182mn  | ~0.897 (=245.*(16./24.)/182.)
+
       16          |  4h05 = 245mn | ~0.765 (=375.*(8./16.)/245.) |  
       32          |  2h29 = 149mn  | ~0.822 (=245.*(16./32.)/149.)
+
       24          |  3h02 = 182mn | ~0.686 (=375.*(8./24.)/182.) | ~0.897 (=245.*(16./24.)/182.)
       48          |  1h56 = 116mn  | ~0.704 (=245.*(16./48.)/116.)
+
       32          |  2h29 = 149mn | ~0.629 (=375.*(8./32.)/149.) | ~0.822 (=245.*(16./32.)/149.)
 +
       48          |  1h56 = 116mn | ~0.539 (=375.*(8./48.)/116.) | ~0.704 (=245.*(16./48.)/116.)
 
</pre>
 
</pre>
 
Note the reasonable speedup up to ~32 cores, but limited improvement when going to the maximum of 48 usable cores (since the model includes 96 latitude intervals).
 
Note the reasonable speedup up to ~32 cores, but limited improvement when going to the maximum of 48 usable cores (since the model includes 96 latitude intervals).

Revision as of 12:33, 31 May 2023

Level of parallelism available in the Venus PCM

Currently the Venus physics package is only adapted to MPI parallelism

Compiling the Venus PCM - LMDZ model with MPI enabled

This requires as a prerequisite that:

  • the Venus PCM is alerady installed and working in serial (i.e. as discussed in Quick Install and Run Venus PCM)
  • an MPI library is installed and available (if none is already available you'll have to build one)
  • The relevant architecture "arch" files (more about these here) have been adapted as need be. In practice this usually means specifying in the arch.fcm that the compiler is mpif90 and also adding paths to the MPI library include and lib directories to the %MPI_FFLAGS and %MPI_LD lines. e,g:
%COMPILER            mpif90
%LINK                mpif90
%AR                  ar
%MAKE                make
%FPP_FLAGS           -P -traditional
%FPP_DEF             NC_DOUBLE
%BASE_FFLAGS         -c -fdefault-real-8 -fdefault-double-8 -ffree-line-length-none -fno-align-commons 
%PROD_FFLAGS         -O3
%DEV_FFLAGS          -O
%DEBUG_FFLAGS        -ffpe-trap=invalid,zero,overflow -fbounds-check -g3 -O0 -fstack-protector-all -finit-real=snan -fbacktrace
%MPI_FFLAGS          -I/usr/openmpi/include
%OMP_FFLAGS         
%BASE_LD     
%MPI_LD              -L/usr/openmpi/lib -lmpi
%OMP_LD

Where in the example above the MPI library is installed in /usr/openmpi.

Once all the prerequisites are met, then one merely needs to compile the PCM with the usual makelmdz_fcm script (located in LMDZ.COMMON) with the -parallell mpi option, e.g.:

./makelmdz_fcm -arch local -p venus -d 48x32x50 -parallel mpi -j 8 gcm

If the compilation went well then the executable will be generated in LMDZ.COMMON/bin and, with the options from the example above, be called gcm_48x32x50_phyvenus_para.e

running in parallel

For the sake of simply we assume here that we are following up on the case described in Quick Install and Run Venus PCM which we now want to run in parallel with MPI. For that, once gcm_48x32x50_phyvenus_para.e created, one simply needs to copy it from LMDZ.COMMON/bin to the directory containing the initial conditions and parameter files, e.g. bench_48x32x50 and then:

1. (optional but recommended step) source the environment architecture file (the very same that was used to compile the model), e.g.,:

source ../LMDZ.COMMON/arch.env

2. execute the model on a given number of processors, for instance 4:

mpirun -np 4 gcm_48x32x50_phyvenus_para.e > gcm.out 2>&1

With this command line, the (text) outputs messages are redirected into a text file, gcm.out. It is convenient to keep this file for later inspection (e.g., to track a bug). If there is no redirection (only mpirun -np 4 gcm_48x32x50_phyvenus_seq.e), then the outputs will be directly on the screen.

Illustration of model speedup with MPI

For what it's worth here are some performance results when running 0.05Vday (96x96x78 grid, with chemistry) with different number of processes, on the Irene-Rome supercomputer:

Number of processes | Run time       | Speedup wrt 8 cores           | Speedup wrt 16 cores
        8           |  6h15 = 375mn  |                               | 
       16           |  4h05 = 245mn  | ~0.765 (=375.*(8./16.)/245.)  | 
       24           |  3h02 = 182mn  | ~0.686 (=375.*(8./24.)/182.)  | ~0.897 (=245.*(16./24.)/182.)
       32           |  2h29 = 149mn  | ~0.629 (=375.*(8./32.)/149.)  | ~0.822 (=245.*(16./32.)/149.)
       48           |  1h56 = 116mn  | ~0.539 (=375.*(8./48.)/116.)  | ~0.704 (=245.*(16./48.)/116.)

Note the reasonable speedup up to ~32 cores, but limited improvement when going to the maximum of 48 usable cores (since the model includes 96 latitude intervals).

Miscellaneous comments

  • There is a limitation to the number of MPI processes one may run with. Each process needs to handle at least 2 bands of latitude. Put another way, for a given number of latitude intervals jjm one may use at most jjm/2 processes (in the example above the grid is 48x32 in lonxlat so one could use at most 32/2=16 processes).
  • if running with the IOIPSL library then there will as many output (histmnth and histins) files as there are processes. One will need to recombine them (using the dedicated rebuild tool) in single global files.
  • PCM start and restart files are insensitive to whether the model was run in serial or parallel. In fact results should be identical (at least in debug mode, where there are no optimisations) between a serial run or a parallel one (regardless of the number of cores used).