Difference between revisions of "Running the Venus PCM in parallel"
(2 intermediate revisions by the same user not shown) | |||
Line 44: | Line 44: | ||
mpirun -np 4 gcm_48x32x50_phyvenus_para.e > gcm.out 2>&1 | mpirun -np 4 gcm_48x32x50_phyvenus_para.e > gcm.out 2>&1 | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | With this command line, the (text) outputs messages are redirected into a text file, '''gcm.out'''. It is convenient to keep this file for later inspection (e.g., to track a bug). If there is no redirection (only '''mpirun -np 4 | + | With this command line, the (text) outputs messages are redirected into a text file, '''gcm.out'''. It is convenient to keep this file for later inspection (e.g., to track a bug). If there is no redirection (only '''mpirun -np 4 gcm_48x32x50_phyvenus_para.e'''), then the outputs will be directly on the screen. |
== Illustration of model speedup with MPI == | == Illustration of model speedup with MPI == | ||
For what it's worth here are some performance results when running 0.05Vday (96x96x78 grid, with chemistry) with different number of processes, on the Irene-Rome supercomputer: | For what it's worth here are some performance results when running 0.05Vday (96x96x78 grid, with chemistry) with different number of processes, on the Irene-Rome supercomputer: | ||
<pre> | <pre> | ||
− | Number of processes | Run time | Speedup wrt 16 cores | + | Number of processes | Run time | Speedup wrt 4 cores | Speedup wrt 8 cores | Speedup wrt 16 cores |
− | 16 | 4h05 = 245mn | | + | 4 | 9h53 = 593mn | | | |
− | 24 | 3h02 = 182mn | ~0.897 (=245.*(16./24.)/182.) | + | 8 | 6h15 = 375mn | ~0.791 (=593.*(4./8.)/375.) | | |
− | 32 | 2h29 = 149mn | ~0.822 (=245.*(16./32.)/149.) | + | 16 | 4h05 = 245mn | ~0.605 (=593.*(4./16.)/245.) | ~0.765 (=375.*(8./16.)/245.) | |
− | 48 | 1h56 = 116mn | ~0.704 (=245.*(16./48.)/116.) | + | 24 | 3h02 = 182mn | ~0.543 (=593.*(4./24.)/182.) | ~0.686 (=375.*(8./24.)/182.) | ~0.897 (=245.*(16./24.)/182.) |
+ | 32 | 2h29 = 149mn | ~0.497 (=593.*(4./32.)/149.) | ~0.629 (=375.*(8./32.)/149.) | ~0.822 (=245.*(16./32.)/149.) | ||
+ | 48 | 1h56 = 116mn | ~0.426 (=593.*(4./48.)/116.) | ~0.539 (=375.*(8./48.)/116.) | ~0.704 (=245.*(16./48.)/116.) | ||
</pre> | </pre> | ||
Note the reasonable speedup up to ~32 cores, but limited improvement when going to the maximum of 48 usable cores (since the model includes 96 latitude intervals). | Note the reasonable speedup up to ~32 cores, but limited improvement when going to the maximum of 48 usable cores (since the model includes 96 latitude intervals). | ||
Line 59: | Line 61: | ||
== Miscellaneous comments == | == Miscellaneous comments == | ||
* There is a limitation to the number of MPI processes one may run with. Each process needs to handle at least 2 bands of latitude. Put another way, for a given number of latitude intervals jjm one may use at most jjm/2 processes (in the example above the grid is 48x32 in lonxlat so one could use at most 32/2=16 processes). | * There is a limitation to the number of MPI processes one may run with. Each process needs to handle at least 2 bands of latitude. Put another way, for a given number of latitude intervals jjm one may use at most jjm/2 processes (in the example above the grid is 48x32 in lonxlat so one could use at most 32/2=16 processes). | ||
− | * if running with the IOIPSL library then there will as many output (histmnth and histins) files as there are processes. One will need to recombine them (using the dedicated ''rebuild'' tool) in single global files. | + | * if running with [[The IOIPSL Library|the IOIPSL library]] then there will as many output (histmnth and histins) files as there are processes. One will need to recombine them (using the dedicated ''rebuild'' tool) in single global files. |
* PCM start and restart files are insensitive to whether the model was run in serial or parallel. In fact results should be identical (at least in debug mode, where there are no optimisations) between a serial run or a parallel one (regardless of the number of cores used). | * PCM start and restart files are insensitive to whether the model was run in serial or parallel. In fact results should be identical (at least in debug mode, where there are no optimisations) between a serial run or a parallel one (regardless of the number of cores used). | ||
[[Category:Venus-LMDZ]] | [[Category:Venus-LMDZ]] |
Latest revision as of 08:14, 5 June 2023
Contents
Level of parallelism available in the Venus PCM
Currently the Venus physics package is only adapted to MPI parallelism
Compiling the Venus PCM - LMDZ model with MPI enabled
This requires as a prerequisite that:
- the Venus PCM is alerady installed and working in serial (i.e. as discussed in Quick Install and Run Venus PCM)
- an MPI library is installed and available (if none is already available you'll have to build one)
- The relevant architecture "arch" files (more about these here) have been adapted as need be. In practice this usually means specifying in the arch.fcm that the compiler is mpif90 and also adding paths to the MPI library include and lib directories to the %MPI_FFLAGS and %MPI_LD lines. e,g:
%COMPILER mpif90 %LINK mpif90 %AR ar %MAKE make %FPP_FLAGS -P -traditional %FPP_DEF NC_DOUBLE %BASE_FFLAGS -c -fdefault-real-8 -fdefault-double-8 -ffree-line-length-none -fno-align-commons %PROD_FFLAGS -O3 %DEV_FFLAGS -O %DEBUG_FFLAGS -ffpe-trap=invalid,zero,overflow -fbounds-check -g3 -O0 -fstack-protector-all -finit-real=snan -fbacktrace %MPI_FFLAGS -I/usr/openmpi/include %OMP_FFLAGS %BASE_LD %MPI_LD -L/usr/openmpi/lib -lmpi %OMP_LD
Where in the example above the MPI library is installed in /usr/openmpi.
Once all the prerequisites are met, then one merely needs to compile the PCM with the usual makelmdz_fcm script (located in LMDZ.COMMON) with the -parallell mpi option, e.g.:
./makelmdz_fcm -arch local -p venus -d 48x32x50 -parallel mpi -j 8 gcm
If the compilation went well then the executable will be generated in LMDZ.COMMON/bin and, with the options from the example above, be called gcm_48x32x50_phyvenus_para.e
running in parallel
For the sake of simply we assume here that we are following up on the case described in Quick Install and Run Venus PCM which we now want to run in parallel with MPI. For that, once gcm_48x32x50_phyvenus_para.e created, one simply needs to copy it from LMDZ.COMMON/bin to the directory containing the initial conditions and parameter files, e.g. bench_48x32x50 and then:
1. (optional but recommended step) source the environment architecture file (the very same that was used to compile the model), e.g.,:
source ../LMDZ.COMMON/arch.env
2. execute the model on a given number of processors, for instance 4:
mpirun -np 4 gcm_48x32x50_phyvenus_para.e > gcm.out 2>&1
With this command line, the (text) outputs messages are redirected into a text file, gcm.out. It is convenient to keep this file for later inspection (e.g., to track a bug). If there is no redirection (only mpirun -np 4 gcm_48x32x50_phyvenus_para.e), then the outputs will be directly on the screen.
Illustration of model speedup with MPI
For what it's worth here are some performance results when running 0.05Vday (96x96x78 grid, with chemistry) with different number of processes, on the Irene-Rome supercomputer:
Number of processes | Run time | Speedup wrt 4 cores | Speedup wrt 8 cores | Speedup wrt 16 cores 4 | 9h53 = 593mn | | | 8 | 6h15 = 375mn | ~0.791 (=593.*(4./8.)/375.) | | 16 | 4h05 = 245mn | ~0.605 (=593.*(4./16.)/245.) | ~0.765 (=375.*(8./16.)/245.) | 24 | 3h02 = 182mn | ~0.543 (=593.*(4./24.)/182.) | ~0.686 (=375.*(8./24.)/182.) | ~0.897 (=245.*(16./24.)/182.) 32 | 2h29 = 149mn | ~0.497 (=593.*(4./32.)/149.) | ~0.629 (=375.*(8./32.)/149.) | ~0.822 (=245.*(16./32.)/149.) 48 | 1h56 = 116mn | ~0.426 (=593.*(4./48.)/116.) | ~0.539 (=375.*(8./48.)/116.) | ~0.704 (=245.*(16./48.)/116.)
Note the reasonable speedup up to ~32 cores, but limited improvement when going to the maximum of 48 usable cores (since the model includes 96 latitude intervals).
Miscellaneous comments
- There is a limitation to the number of MPI processes one may run with. Each process needs to handle at least 2 bands of latitude. Put another way, for a given number of latitude intervals jjm one may use at most jjm/2 processes (in the example above the grid is 48x32 in lonxlat so one could use at most 32/2=16 processes).
- if running with the IOIPSL library then there will as many output (histmnth and histins) files as there are processes. One will need to recombine them (using the dedicated rebuild tool) in single global files.
- PCM start and restart files are insensitive to whether the model was run in serial or parallel. In fact results should be identical (at least in debug mode, where there are no optimisations) between a serial run or a parallel one (regardless of the number of cores used).