Difference between revisions of "Running Mars mesoscale model"
Line 16: | Line 16: | ||
To restart from an existing restart file, we must change the "restart" option to .true., and the start_month,start_day, and start_hour need to be updated to coincide with the time of the restart file. Since we can tune the starting of the simulation with a resolution of hours (and not minutes), this means that the restart files need to be saved at integer hours, and therefore the restart_interval option should have values 37*N, with N the number of hours of interval between restart files. | To restart from an existing restart file, we must change the "restart" option to .true., and the start_month,start_day, and start_hour need to be updated to coincide with the time of the restart file. Since we can tune the starting of the simulation with a resolution of hours (and not minutes), this means that the restart files need to be saved at integer hours, and therefore the restart_interval option should have values 37*N, with N the number of hours of interval between restart files. | ||
− | ==Troubleshooting restart files in big simulations== | + | ==Troubleshooting restart files in big simulations (ONGOING)== |
− | + | In big simulations restart files fail to be saved. This does not cause any problem, and the simulation keeps running. When increasing verbosity to one, the rsl.error file contains this error messages (Note that for this debug execution we set restart_interval to 1 to obtain the error quickly): | |
+ | |||
+ | <syntaxhighlight> | ||
+ | d01 2024-10-01_06:01:00 med_restart_out: opening wrfrst_d01_2024-10-01_06:01:00 | ||
+ | for writing | ||
+ | d01 2024-10-01_06:01:00 NetCDF error: NetCDF: One or more variable sizes viola | ||
+ | te format constraints | ||
+ | d01 2024-10-01_06:01:00 NetCDF error in ext_ncd_open_for_write_commit wrf_io.F | ||
+ | 90, line 1279 | ||
+ | med_restart_out: opening wrfrst_d01_2024-10-01_06:01:00 for writing | ||
+ | d01 2024-10-01_06:01:00 Warning 2 DRYRUNS 1 VARIABLE in wrf_io.F90, line | ||
+ | 2303 | ||
+ | |||
+ | [This line repeats many times] | ||
+ | |||
+ | d01 2024-10-01_06:01:00 Warning 2 DRYRUNS 1 VARIABLE in wrf_io.F90, line | ||
+ | 2303 | ||
+ | Timing for Writing restart for domain 1: 0.06920 elapsed seconds. | ||
+ | d01 2024-10-01_06:01:00 Warning TRY TO CLOSE DRYRUN in ext_ncd_ioclose wrf_io. | ||
+ | F90, line 1311 | ||
+ | d01 2024-10-01_06:01:00 NetCDF error: NetCDF: One or more variable sizes viola | ||
+ | te format constraints | ||
+ | d01 2024-10-01_06:01:00 NetCDF error in ext_ncd_ioclose wrf_io.F90, line | ||
+ | 1329 | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | The problem is in the file wrf_io.F90. When we execute a search for this file, there are 4 different copies of it: | ||
+ | |||
+ | <syntaxhighlight> | ||
+ | ./code/MESOSCALE/LMD_MM_MARS/mars_lmd_new_real_mpifort_64/WRFV2/external/io_netcdf/wrf_io.F90 | ||
+ | ./code/MESOSCALE/LMD_MM_MARS/mars_lmd_new_real_mpifort_64/WRFV2/external/io_pnetcdf/wrf_io.F90 | ||
+ | ./code/MESOSCALE/LMD_MM_MARS/SRC/WRFV2/external/io_netcdf/wrf_io.F90 | ||
+ | ./code/MESOSCALE/LMD_MM_MARS/SRC/WRFV2/external/io_pnetcdf/wrf_io.F90 | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | Apparently, the problem is related to the big size of the restart files. In line 1188 (1195 in some versions of the file we find: | ||
+ | <syntaxhighlight> | ||
+ | stat = NF_CREATE(FileName, NF_CLOBBER, DH%NCID) | ||
+ | </syntaxhighlight> | ||
+ | This is documented in https://docs.unidata.ucar.edu/netcdf-fortran/current/nc_f77_interface_guide.html | ||
+ | |||
+ | Apparently, these parameters only work when writing files smaller than 2gb. We can fix that be replacing this line by: | ||
+ | <syntaxhighlight> | ||
+ | stat = NF_CREATE(FileName, IOR(NF_CLOBBER,NF_64BIT_OFFSET), DH%NCID) | ||
+ | </syntaxhighlight> |
Revision as of 14:08, 24 February 2025
This is page I start to introduce some notes to remind that could be useful for other users (Jorge).
The user manual can be found at https://gitlab.in2p3.fr/la-communaut-des-mod-les-atmosph-riques-plan-taires/git-trunk/-/blob/master/MESOSCALE/MANUAL/SRC/user_manual.pdf?ref_type=heads
Regarding NaNs and parallel running
The mesoscale model cannot run in parallel in spirit at the moment, it runs but it outputs NaNs everywhere.
When running in a single core, in my case at least it produces NaNs with low output frequency (37) but by increasing the output frequency (6 works for me) it is solved. This is weird but seem to be right.
Increasing the output frequency in parallel runs doesn't work. Even with a frequency of 1 doesn't work.
Restarting simulations
In the file namelist.input there is the option "restart". The comment for this option says "Output restart files?" but actually it refers to whether or not the simulation should try to start from a previous simulation. Restart files are periodically saved, this can be configured with restart_interval. The value is an integer number corresponding to the number of "minutes" (with a Martian hour having 37 minutes) between restart files. By default this is very long (8880, corresponding to 10 sols). If we write 37 it will be every hour. If we write 37*6 it will be every 6 hours, etc.
To restart from an existing restart file, we must change the "restart" option to .true., and the start_month,start_day, and start_hour need to be updated to coincide with the time of the restart file. Since we can tune the starting of the simulation with a resolution of hours (and not minutes), this means that the restart files need to be saved at integer hours, and therefore the restart_interval option should have values 37*N, with N the number of hours of interval between restart files.
Troubleshooting restart files in big simulations (ONGOING)
In big simulations restart files fail to be saved. This does not cause any problem, and the simulation keeps running. When increasing verbosity to one, the rsl.error file contains this error messages (Note that for this debug execution we set restart_interval to 1 to obtain the error quickly):
d01 2024-10-01_06:01:00 med_restart_out: opening wrfrst_d01_2024-10-01_06:01:00
for writing
d01 2024-10-01_06:01:00 NetCDF error: NetCDF: One or more variable sizes viola
te format constraints
d01 2024-10-01_06:01:00 NetCDF error in ext_ncd_open_for_write_commit wrf_io.F
90, line 1279
med_restart_out: opening wrfrst_d01_2024-10-01_06:01:00 for writing
d01 2024-10-01_06:01:00 Warning 2 DRYRUNS 1 VARIABLE in wrf_io.F90, line
2303
[This line repeats many times]
d01 2024-10-01_06:01:00 Warning 2 DRYRUNS 1 VARIABLE in wrf_io.F90, line
2303
Timing for Writing restart for domain 1: 0.06920 elapsed seconds.
d01 2024-10-01_06:01:00 Warning TRY TO CLOSE DRYRUN in ext_ncd_ioclose wrf_io.
F90, line 1311
d01 2024-10-01_06:01:00 NetCDF error: NetCDF: One or more variable sizes viola
te format constraints
d01 2024-10-01_06:01:00 NetCDF error in ext_ncd_ioclose wrf_io.F90, line
1329
The problem is in the file wrf_io.F90. When we execute a search for this file, there are 4 different copies of it:
./code/MESOSCALE/LMD_MM_MARS/mars_lmd_new_real_mpifort_64/WRFV2/external/io_netcdf/wrf_io.F90
./code/MESOSCALE/LMD_MM_MARS/mars_lmd_new_real_mpifort_64/WRFV2/external/io_pnetcdf/wrf_io.F90
./code/MESOSCALE/LMD_MM_MARS/SRC/WRFV2/external/io_netcdf/wrf_io.F90
./code/MESOSCALE/LMD_MM_MARS/SRC/WRFV2/external/io_pnetcdf/wrf_io.F90
Apparently, the problem is related to the big size of the restart files. In line 1188 (1195 in some versions of the file we find:
stat = NF_CREATE(FileName, NF_CLOBBER, DH%NCID)
This is documented in https://docs.unidata.ucar.edu/netcdf-fortran/current/nc_f77_interface_guide.html
Apparently, these parameters only work when writing files smaller than 2gb. We can fix that be replacing this line by:
stat = NF_CREATE(FileName, IOR(NF_CLOBBER,NF_64BIT_OFFSET), DH%NCID)