Difference between revisions of "Using Irene Rome"

From Planets
Jump to: navigation, search
(Pure MPI + long run)
 
(18 intermediate revisions by 4 users not shown)
Line 17: Line 17:
 
- Responsable scientifique du projet: M. Ehouarn MILLOUR , ehouarn.millour@lmd.ipsl.fr, 0144275286, Nationalité: Fr
 
- Responsable scientifique du projet: M. Ehouarn MILLOUR , ehouarn.millour@lmd.ipsl.fr, 0144275286, Nationalité: Fr
  
- Responsable sécurité: M. Franck Guyon, franck.guyon@lmd.ipsl.fr, 0144275277, Nationalité: Fr
+
- Responsable sécurité: M. Franck Guyon, franck.guyon@lmd.ipsl.fr, 0144275277, Nationalité: Fr. Note that this is assuming you have an account on LMD computers and/or the MESOIPSL cluster. Otherwise you must provide the information for the relevant person from your lab.
  
- IPs & machine names to connect to Irene: 134.157.47.46 (ssh-out.lmd.jussieu.fr) and 134.157.176.129 (ciclad.ipsl.upmc.fr)
+
- IPs & machine names to connect to Irene: 134.157.47.46 (ssh-out.lmd.jussieu.fr) and 134.157.176.129 (ciclad.ipsl.upmc.fr). Note that this is assuming you have an account on LMD computers and/or the MESOIPSL cluster. Otherwise you must provide name & IP of your institute's gateway machine.
  
 
- Chose anything you want for the 8 character password
 
- Chose anything you want for the 8 character password
Line 29: Line 29:
 
== Some useful commands ==
 
== Some useful commands ==
  
* To access the disks of our project on Irene ("Atmosphères Planétaires" GENCI project), add the following line in your .bashrc file:
+
* To access the disks of our project on Irene ("Atmosphères Planétaires" GENCI project), add the following line in your .bash_profile file:
 +
(.bash_profile rather than .bashrc, because otherwise scp may not find your files when fetching them from $CCCWORKDIR on irene to your machine)
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
Line 70: Line 71:
 
* Directories that have been empty for more than 30 days are removed
 
* Directories that have been empty for more than 30 days are removed
 
</pre>
 
</pre>
 +
There is a dedicated command
 +
<syntaxhighlight lang="bash">
 +
ccc_will_purge
 +
</syntaxhighlight>
 +
which should return information about upcoming purge on your scratch space. Note that the database this command queries is only refreshed on a nightly basis, so it will incorrectly reflect changes you just made this very day.
 +
 +
One nice idea is to put this command in your .bashrc, so that it runs everytime you connect to Irene (you can add the ''-s'' option to avoid having long list of files displaying on your screen). If you do so, you have to encapsulate it in a conditional test to check if you're in interactive mode, otherwise it will prevent any rsync/scp between Irene and other machines (which are not in interactive and will thus block on the ccc_will_purge command) :
 +
<syntaxhighlight lang="bash">
 +
if [ -n "${PS1:-}" ]; then
 +
    ccc_will_purge -s
 +
fi
 +
</syntaxhighlight>
 +
 +
And a nice command (copyleft Antoine Martinez and Olivier from LATMOS) to update all the "Access" and "Change" stats of the files in your current directory (and sub-directories), without changing the "Modify" stat (so that you can still sort them by date of modification with 'ls -lt') :
 +
<syntaxhighlight lang="bash">
 +
cd $CCCSCRATCHDIR
 +
find . -type f -exec touch -a {} \;
 +
</syntaxhighlight>
 +
This will prevent your files on the scratchdir to be purged for another 60 days.
  
 
* To access Irene Interactive Documentation:
 
* To access Irene Interactive Documentation:
Line 142: Line 162:
 
(ideally a multiple of 128, which is the number of CPUs per core).
 
(ideally a multiple of 128, which is the number of CPUs per core).
  
Runs longer than 1d are not possible unless
+
Runs longer than 1 day are not possible unless
 
selecting a quality-of-service (QoS) "long"
 
selecting a quality-of-service (QoS) "long"
  
Line 170: Line 190:
 
#MSUB -Q long
 
#MSUB -Q long
  
#############################
 
## WRF 257x257 with 1024 proc
 
## leads to 8x8 tiles
 
## 32 tasks over X
 
## 32 tasks over Y
 
#############################
 
## WRF 129x129 with 256 proc
 
## leads to 8x8 tiles
 
## 16 tasks over X
 
## 16 tasks over Y
 
 
#############################
 
#############################
  
Line 199: Line 209:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
=== Main submission commands ===
+
 
 +
=== Priority Short Job ===
 +
If you have a job that requires less than 30 minutes of user time (<code>#MSUB -T 1800</code> or less) and no more than 2 nodes, you can launch your job with a "test" quality-of-service (QoS).
 +
This will double the priority of your job in the queue compared to the "normal" or "long" QoS.
 +
 
 +
<syntaxhighlight lang="bash">
 +
#!/bin/bash
 +
...
 +
# select quality-of-service
 +
# - test < 30min
 +
# - normal < 1d (default)
 +
# - long < 3d
 +
#MSUB -Q test
 +
 
 +
...
 +
 
 +
</syntaxhighlight>
 +
 
 +
== Main submission commands ==
  
 
* To launch the job script ''run_gcm.job'':
 
* To launch the job script ''run_gcm.job'':
Line 212: Line 240:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
ccc_mdel jobid
 
ccc_mdel jobid
 +
</syntaxhighlight>
 +
* To display infos about a job (works both while it is running or after it has finished):
 +
<syntaxhighlight lang="bash">
 +
ccc_macct jobid
 
</syntaxhighlight>
 
</syntaxhighlight>
 
* To display infos about project accounting:
 
* To display infos about project accounting:
Line 217: Line 249:
 
ccc_myproject
 
ccc_myproject
 
</syntaxhighlight>
 
</syntaxhighlight>
* To display infos about limits:
+
* To display infos about limits (quality-of-service QoS):
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
ccc_mqinfo
 
ccc_mqinfo
 
</syntaxhighlight>
 
</syntaxhighlight>
 
+
* To display infos about partitions:
== Extra Tips ==
 
 
 
*If you encounter a quota issue on Irene, first check:
 
 
 
<syntaxhighlight lang="bash">
 
ccc_quota
 
</syntaxhighlight>
 
 
 
if you ha "disk quota exceeded error message", this might be because your files/scripts do not have correct right access. To solve this,
 
use the following command on all your dirs (before tranfering them to Irene):
 
 
 
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
chmod -R g+s NAME_OF_DIR
+
ccc_mpinfo
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Line 251: Line 272:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
To copy a file:
+
To copy a file (typically a big tar file):
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
ccfr_cp occigenlogin@cines:remote_dir local_dir  
 
ccfr_cp occigenlogin@cines:remote_dir local_dir  
 +
</syntaxhighlight>
 +
This command will fail if you try copying over a directory, with a weird error message of the likes of:
 +
<pre>
 +
rsync: write failed on "/ccc/work/...." Disk quota exceeded (122)
 +
</pre>
 +
If copying over directories (via rsync) then you should modify the related rights (which must be "s" for the group on Irene) using dedicated options:
 +
<syntaxhighlight lang="bash">
 +
rsync --chmod=Dg+s --chown=:gen10391 source_on_distant_machine target_on_Irene 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
== Worth knowing about ==
 
== Worth knowing about ==
 
* The command wget is disabled on Irene, scripts using it will fail...
 
* The command wget is disabled on Irene, scripts using it will fail...
* Only "https" is allowed (for svn co, git, etc)
+
* Only "https" is allowed. This is important for instance when downloading code (via svn, git). Moreover for svn checkouts one has to provide their svn username and password.
 
+
*If you encounter a quota issue on Irene, first check if indeed you (or the group's) quota is exceeded:
[[Category:FAQ]]
 
  
 +
<syntaxhighlight lang="bash">
 +
ccc_quota
 +
</syntaxhighlight>
  
 +
In case you get an error message about your quota when copying data over to Irene, it might be because the rights on directories are not well set; please use the following command on your directory before sending the data to Irene:
 +
<syntaxhighlight lang="bash">
 +
chmod -R g+s NAME_OF_DIR
 +
</syntaxhighlight>
  
In case you reach a quota issue, please use the following command on your directory before sending the data:
 
  
 +
== Are you being disconnected when inactive? ==
 +
If you are regularly being disconnected when a bit inactive on the supercomputer, adding these few lines in a ''config'' file in the .ssh/ repository of your logging machine (ex: ssh-out/ciclad) may help :
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
chmod -R g+s NAME_OF_DIR
+
Host *
 +
...
 +
KeepAlive yes
 +
TCPKeepAlive yes
 +
ServerAliveInterval 15
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
[[Category:FAQ]]

Latest revision as of 11:19, 19 July 2023

This page provides a summary of examples and tools designed to help you get used with the Irene Rome environment. (as of July 2022)

How to access the cluster

For people on the "Atmosphères Planétaires" GENCI project who need to open an account on Irene-Rome, here is the procedure:

A few tips:

- chose TGCC

- give your PROFESSIONAL phone number (and not your personal cell phone number)

- name of the project: Atmosphères Planétaires Numéro du Dossier: A0120110391

- Responsable scientifique du projet: M. Ehouarn MILLOUR , ehouarn.millour@lmd.ipsl.fr, 0144275286, Nationalité: Fr

- Responsable sécurité: M. Franck Guyon, franck.guyon@lmd.ipsl.fr, 0144275277, Nationalité: Fr. Note that this is assuming you have an account on LMD computers and/or the MESOIPSL cluster. Otherwise you must provide the information for the relevant person from your lab.

- IPs & machine names to connect to Irene: 134.157.47.46 (ssh-out.lmd.jussieu.fr) and 134.157.176.129 (ciclad.ipsl.upmc.fr). Note that this is assuming you have an account on LMD computers and/or the MESOIPSL cluster. Otherwise you must provide name & IP of your institute's gateway machine.

- Chose anything you want for the 8 character password

  • And then get Ehouarn to sign the form and forward it to Franck for him to sign as well.
  • Send the signed form to hotline.tgcc@cea.fr

Some useful commands

  • To access the disks of our project on Irene ("Atmosphères Planétaires" GENCI project), add the following line in your .bash_profile file:

(.bash_profile rather than .bashrc, because otherwise scp may not find your files when fetching them from $CCCWORKDIR on irene to your machine)

module switch dfldatadir/gen10391
  • To access your work directory (to run your simulations)
cd /ccc/work/cont003/gen10391/

you can also access the work directory with:

cd $CCCWORKDIR
  • To access your store directory (to store big data files we are limited in inode number not in filesize! It is recommended to store files of at least 50M, preferably more, e.g. big tar files of 10G or more)
cd /ccc/work/cont003/gen10391/

you can also access the store directory with:

cd $CCCSTOREDIR
  • To access the scratch directory
cd $CCCSCRATCHDIR

IMPORTANT: the scratchdir is fast access, very big, BUT regularly automatically purged! If you use it do remember to backup stuff on the WORKDIR or STOREDIR.

The scratch purge policy (from machine.info):

* Files not accessed for 60 days are automatically purged
* Symbolic links are not purged
* Directories that have been empty for more than 30 days are removed

There is a dedicated command

ccc_will_purge

which should return information about upcoming purge on your scratch space. Note that the database this command queries is only refreshed on a nightly basis, so it will incorrectly reflect changes you just made this very day.

One nice idea is to put this command in your .bashrc, so that it runs everytime you connect to Irene (you can add the -s option to avoid having long list of files displaying on your screen). If you do so, you have to encapsulate it in a conditional test to check if you're in interactive mode, otherwise it will prevent any rsync/scp between Irene and other machines (which are not in interactive and will thus block on the ccc_will_purge command) :

if [ -n "${PS1:-}" ]; then
    ccc_will_purge -s
fi

And a nice command (copyleft Antoine Martinez and Olivier from LATMOS) to update all the "Access" and "Change" stats of the files in your current directory (and sub-directories), without changing the "Modify" stat (so that you can still sort them by date of modification with 'ls -lt') :

cd $CCCSCRATCHDIR
find . -type f -exec touch -a {} \;

This will prevent your files on the scratchdir to be purged for another 60 days.

  • To access Irene Interactive Documentation:
machine.info

NB: you can also access the online documentation here: http://www-hpc.cea.fr/tgcc-public/en/html/toc/fulldoc/Introduction.html

  • To display infos about project accounting:
ccc_myproject
  • To know about user and group disk quota
ccc_quota -a
  • To know about how long your passwd will be active:
ccc_password_expiration
  • To change passwd:
passwd

Example of a job to run a GCM simulation

Mixed openMP / MPI

#!/bin/bash
# Partition to run on:
#MSUB -q rome
# project to run on 
#MSUB -A  gen10391
# disks to use
#MSUB -m  scratch,work,store
# Job name
#MSUB -r run_gcm
# Job standard output:
#MSUB -o run_gcm.%I
# Job standard error:
#MSUB -e run_gcm.%I
# number of OpenMP threads c
#MSUB -c 2
# number of MPI tasks n
#MSUB -n 16
# number of nodes to use N
#MSUB -N 1
# max job run time T (in seconds)
#MSUB -T 3600
# request exculsive use of the node (128 cores)
##MSUB -x

source ../trunk/LMDZ.COMMON/arch.env
export OMP_STACKSIZE=400M
export OMP_NUM_THREADS=2

ccc_mprun -l gcm_32x32x15_phystd_para.e > gcm.out 2>&1

Pure MPI + long run

The most important parameter for mpi-only runs is -n providing the total number of CPUs (ideally a multiple of 128, which is the number of CPUs per core).

Runs longer than 1 day are not possible unless selecting a quality-of-service (QoS) "long"

#!/bin/bash

# Partition to run on:
#MSUB -q rome
# project to run on 
#MSUB -A  gen10391
# disks to use
#MSUB -m  scratch,work,store
# Job name
#MSUB -r run_wrf
# Job standard output:
#MSUB -o run_wrf.%I
# Job standard error:
#MSUB -e run_wrf.%I
# number of MPI tasks n (total)
#MSUB -n 256
# max job run time T (in seconds)
#MSUB -T 345600
# select quality-of-service
# - test < 30min
# - normal < 1d (default)
# - long < 3d 
#MSUB -Q long

#############################

# load the modules used to compile
source arch.env

# clean the logs
rm -rf rsl.*

# create initial state
# -- this is done on a single proc
ideal.exe
mv rsl.error.0000 ideal_rsl.error.0000
mv rsl.out.0000 ideal_rsl.out.0000

# main launch
ccc_mprun -l wrf.exe


Priority Short Job

If you have a job that requires less than 30 minutes of user time (#MSUB -T 1800 or less) and no more than 2 nodes, you can launch your job with a "test" quality-of-service (QoS). This will double the priority of your job in the queue compared to the "normal" or "long" QoS.

#!/bin/bash
...
# select quality-of-service
# - test < 30min
# - normal < 1d (default)
# - long < 3d
#MSUB -Q test

...

Main submission commands

  • To launch the job script run_gcm.job:
ccc_msub run_gcm.job
  • To display information about your jobs:
ccc_mpp -u $USER
  • To kill job number jobid
ccc_mdel jobid
  • To display infos about a job (works both while it is running or after it has finished):
ccc_macct jobid
  • To display infos about project accounting:
ccc_myproject
  • To display infos about limits (quality-of-service QoS):
ccc_mqinfo
  • To display infos about partitions:
ccc_mpinfo

File transfert from Occigen

One should use ccfr:

module load ccfr

A list of available machines is given by

ccfr_ssh -v

To log on to Occigen (from Irene):

ccfr_ssh occigenlogin@cines

To copy a file (typically a big tar file):

ccfr_cp occigenlogin@cines:remote_dir local_dir

This command will fail if you try copying over a directory, with a weird error message of the likes of:

rsync: write failed on "/ccc/work/...." Disk quota exceeded (122)

If copying over directories (via rsync) then you should modify the related rights (which must be "s" for the group on Irene) using dedicated options:

rsync --chmod=Dg+s --chown=:gen10391 source_on_distant_machine target_on_Irene

Worth knowing about

  • The command wget is disabled on Irene, scripts using it will fail...
  • Only "https" is allowed. This is important for instance when downloading code (via svn, git). Moreover for svn checkouts one has to provide their svn username and password.
  • If you encounter a quota issue on Irene, first check if indeed you (or the group's) quota is exceeded:
ccc_quota

In case you get an error message about your quota when copying data over to Irene, it might be because the rights on directories are not well set; please use the following command on your directory before sending the data to Irene:

chmod -R g+s NAME_OF_DIR


Are you being disconnected when inactive?

If you are regularly being disconnected when a bit inactive on the supercomputer, adding these few lines in a config file in the .ssh/ repository of your logging machine (ex: ssh-out/ciclad) may help :

Host *
...
KeepAlive yes
TCPKeepAlive yes
ServerAliveInterval 15