Frequently Asked Questions about Using the HPC Cluster
- How do I log into the HPC cluster?
- How do I avoid getting logged out of HPC due to a bad Wi-Fi connection?
- What shell am I using? Can I use a different shell?
- How do I run jobs on the cluster?
- How can I tell when my job will run?
- How do I tell if my job is running on multiple cores?
- How do I create/edit a text file?
- How do I create a SLURM file?
- Can I use the local storage on a compute node?
- How do I specify which account to submit a job to?
- How do I report a problem with a job submission?
- How do I make my HPC program run faster?
- Why is my job stuck in the submission queue?
- How do I know which allocation I should use to submit a job if I am in multiple HPC allocations? How do I specify a certain allocation?
- I accidentally deleted a file, is it possible to recover it?
- Which file system should I store my project data in?
- How do I share my project data with another user?
- How do I check if I have enough disk space?
- I’m running out of space, how do I check the size of my files and directories?
General Cluster Questions
How do I log into the HPC cluster?
To log in to the Linux cluster resource, you will need to use ssh to access either hpc-login2 or hpc-login3.
These head nodes should only be used for editing and compiling programs; any computing should be done on the compute nodes. Computing jobs submitted to the head nodes may be terminated before they complete. To submit jobs to the compute nodes, use the SLURM resource manager.
How do I avoid getting logged out of HPC due to a bad Wi-Fi connection?
HPC will log you off a head node after 20 minutes of inactivity but sometimes you are logged off due to an unstable wifi connection. Adding the two lines below to ~/.ssh/config may help with an unstable connection. (You will need to create a config file if one doesn’t exist.)
Host * ServerAliveInterval 60 ServerAliveCountMax 2
The lines tell your computer to send two “alive” signals every 60 seconds before allowing the connection to be terminated. This change must be done on your laptop or client computer. If you are connecting from a Windows computer, you will have to check the documentation of your ssh client to set the ‘KeepAlive’ interval.
What shell am I using? Can I use a different shell?
The default shell for new accounts is bash. You can check what your current shell is by typing the command [user@hpc-login3 ~]$ echo $0.
If you’d like to change the shell you are using you can type ‘bash’ or ‘csh’ after the echo $0 command to temporarily use a new shell. If you’d like to permanently change your default shell, send email to firstname.lastname@example.org.
Running Jobs on the Cluster
How do I run jobs on the cluster?
Jobs can be run on the cluster in batch mode or in interactive mode. Batch processing is performed remotely and without manual intervention. Interactive mode enables you to test your program and environment setup interactively using the salloc command. When your job is running interactively as expected, you should then submit it for batch processing. This is done by creating a simple text file, called a SLURM script, that specifies the cluster resources you need and the commands necessary to run your program.
For details and examples on how to run jobs, see the Running a Job on HPC using Slurm page.
How can I tell when my job will run?
After submitting a job to the queue, you can use the command squeue -j <job_id> --start where <job_id> is the reference number the job scheduler uses to keep track of your job. The squeue command will give you an estimate based on historical usage and availability of resources. Please note that there is no way to know in advance what the exact wait time will be and the expected start time may change over time.
How can I tell if my job is running?
You can check the status of your job using the myqueue command or the squeue <-u username> command. If your job is running but you are still unsure if your program is working, you can ssh into your compute nodes and use the command top to see what is running.
In general, we recommend that users request an interactive session to test out their jobs. This will give you immediate feedback if there are errors in your program or syntax. Once you are confident that your job can complete without your intervention, you are ready to submit a batch job using a SLURM script.
How do I tell if my job is running on multiple cores?
You can check the resources your program is consuming using the ‘top’ process manager:
- Request an interactive compute node using the salloc command.
- Run your job.
- Open a second terminal window, ssh to your compute node, and run the top command. This will display the processes running on that node.
- The number of your job processes should match the number of tasks (ntasks) requested in your salloc command.
> [ttrojan@hpc-login3 ~]$ salloc --ntasks=8 ---------------------------------------- Begin SLURM Prolog Wed 21 Feb 2018 02:34:35 PM PST Job ID: 767 Username: ttrojan Accountname: lc_usc1 Name: bash Partition: quick Nodes: hpc3264 TasksPerNode: 15(x2) CPUSPerTask: Default TMPDIR: /tmp/767.quick Cluster: uschpc HSDA Account: false End SLURM Prolog ---------------------------------------- [ttrojan@hpc3264 ~]$mpirun find
[ttrojan@hpc-login3 ~]$ ssh hpc3264 [ttrojan@hpc3264]$ top top - 15:37:36 up 21:50, 1 user, load average: 0.00, 0.01, 0.05 Tasks: 285 total, 1 running, 284 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 65766384 total, 64225800 free, 970788 used, 569796 buff/cache KiB Swap: 8388604 total, 8388604 free, 0 used. 64535076 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15191 ttrojan 20 0 139m 5684 1500 R 2.7 0.0 0:00.04 mpirun 15195 ttrojan 20 0 15344 996 768 S 1.3 0.0 0:00.08 find 15196 ttrojan 20 0 15344 996 768 S 1.3 0.0 0:00.08 find 15199 ttrojan 20 0 15344 996 768 S 1.3 0.0 0:00.08 find 15203 ttrojan 20 0 15344 996 768 S 1.3 0.0 0:00.08 find 15204 ttrojan 20 0 15344 996 768 S 1.3 0.0 0:00.08 find 15205 ttrojan 20 0 15344 996 768 S 1.3 0.0 0:00.08 find 15206 ttrojan 20 0 15344 996 768 S 1.3 0.0 0:00.08 find 15207 ttrojan 20 0 15344 996 768 S 1.3 0.0 0:00.08 find
If you see only one process, your job is using only one core.
How do I create/edit a text file?
HPC supports the following UNIX editors: vim (vi), nano, and emacs. Nano is the editor we teach in our workshops because of its ease of use.
Additional information on UNIX editors can be found in the UNIX Overview section of the ITS website.
How do I create a SLURM file?
A SLURM file, or script, is a simple text file that contains your cluster resource requests and the commands necessary to run your program. See the Running a Job on the HPC Cluster page for instructions on how to create and use SLURM scripts.
Can I use the local storage on a compute node?
Yes, you can temporarily access the local disk space on a single node using the $TMPDIR environment variable in your job scripts. For a multi-node job you can use the /scratch file system as your working directory for all jobs.
For more information, see the Temporary Disk Space page.
How do I specify which account to submit a job to?
If you are only part of a single account there is no need to specify which account to use. If you are part of multiple accounts the resource manager will consume the core-hours allocation of your default group unless you specify a different one. To avoid confusion it is best to specify which group’s allocation to use in your SLURM script like so:
where <account_id> is your account name of the form lc_xxx.
How do I report a problem with a job submission?
If a job submission results in an error, please send an email to email@example.com. Be sure to include the job ID, error message, and any additional information you can provide.
How do I make my HPC program run faster?
Each program is unique so it is hard to give advice that will be useful for every situation. For some suggestions that you can use as a starting point, see Optimizing Your HPC Programs.
If you need more detailed help, request a consultation with our research computing facilitators by emailing firstname.lastname@example.org.
Why is my job stuck in the submission queue?
When running jobs on HPC, you may experience jobs that get stuck in the submission queue. For information on why that may be happening and how to get your jobs “unstuck,” see the HPC Jobs Stuck in the Queue page.
How do I know which allocation I should use to submit a job if I am in multiple HPC allocations? How do I specify a certain allocation?
To see a listing of all of your available accounts and your current core hour allocations in these accounts, use the following command:
The default HPC allocation is used to run a job when no allocation is specified in the salloc, srun and sbatch commands.
You can override the default account by using the -A
sbatch --account=<account_id> myjob.slurm
For further details on salloc, srun, and sbatch, please read the official man pages available by typing the following on any HPC login node:
Data Files and Disk Space
I accidentally deleted a file, is it possible to recover it?
Your project and home directories are backed up every day as well as once a week. Daily backups are saved for up to a week while weekly backups are saved for up to a month. In order to be a candidate for archiving, files must be closed and idle for at least 20 minutes. If you know the name and path of the file you deleted we can search your backup directory and attempt to retrieve it. We’re more likely to recover your file from a daily backup than a weekly one so contact us as soon as possible.
Which file system should I store my project data in?
HPC has several different file systems, as summarized in the table below.
|Name||Path||Amount of Space||Backed up?||Purpose|
|Home||~, /home/rcf-40/<user>||1GB||Yes||Configuration files, personal scripts|
|Project||/home/rcf-proj/<proj_name>||Up to 5TB, shared amongst group members||Yes||Medium-term data storage while running HPC jobs|
|Staging||/staging/<proj_name>||10TB per account||No||Short-term, high perfomance data storage|
|Temporary||$TMPDIR (single node), $SCRATCHDIR (multi-node)||Varies, depends on resources requested||No, deleted at end of job||Short-term (per job) high performance data storage. Not shared with other researchers|
The home, project, and staging file systems are shared, which means that your usage can impact and be impacted by the activities of other users.
For more details on the staging and local scratch file systems, see the Temporary Disk Space page.
How do I share my project data with another user?
If the user is already a member of your project, the project PI can create a shared folder in the top level directory of the project and set its permissions to be writable by the group by using the commands below. NOTE: Only the project PI can create a directory.
$ mkdir shared $ chmod g+wxs shared
If you would like to consistently share data with a user who is not in your group, it is best to either have the PI add them to your project group or apply for a second account together. If this is not possible and you still need to share storage, send an email to email@example.com to explore other options.
How do I check if I have enough disk space?
Before you submit a large job or install new software, you should check that you have sufficient disk space. Your home directory has limited space so you should use your project directory for research and software installation.
To check your quota, use the myquota command. Compare the results of both the Files Used and Files Soft sections and Bytes Used and Bytes Soft sections for each directory. If the value of Used is close to the value of Soft in either case, you will need to delete files or request an increase in disk space from the account application site.
$ myquota --------------------------------------------- Disk Quota for /home/rcf-40/ttrojan Used Soft Hard Files 1897 100000 101000 Bytes 651.15M 1.00G 1.00G --------------------------------------------- Disk Quota for /home/rcf-proj2/tt1 Used Soft Hard Files 273680 1000000 1001000 Bytes 55.98G 1.00T 1.02T ---------------------------------------------
I’m running out of space, how do I check the size of my files and directories?
To check your disk usage, use the du command.
To list the largest directories from the current directory, use the following command:
$ du -s * | sort -nr | head -n10
- du -s *: Summarizes disk usage of all files
- sort -nr: Sorts numerically, in reverse order
- head -n10: Shows the first ten lines from head
To list the top 10 largest files from the current directory, use the following command:
$ du . | sort -nr | head -n10
- du .: Shows disk usage of current directory
- sort -nr: Sorts numerically, in reverse order
- head -n10: Shows the first ten lines from head
To see all other options for ‘du’, use the following command:
$ man du
For questions regarding HPC accounts, software on HPC, or education and training on HPC resources, click one of the following links: