Frequently Asked Questions about Using the HPC Cluster

General Cluster Questions

Running Jobs on the Cluster

Data Files and Disk Space

Additional Questions?

General Cluster Questions

How do I log into the HPC cluster?

To log in to the Linux cluster resource, you will need to use ssh to access either hpc-login2 or hpc-login3.

These head nodes should only be used for editing and compiling programs; any computing should be done on the compute nodes.  Computing jobs submitted to the head nodes may be terminated before they complete. To submit jobs to the compute nodes, use the SLURM resource manager.

How do I avoid getting logged out of HPC due to a bad Wi-Fi connection?

HPC will log you off a head node after 20 minutes of inactivity but sometimes you are logged off due to an unstable wifi connection. Adding the two lines below to ~/.ssh/config may help with an unstable connection. (You will need to create a config file if one doesn’t exist.)

Host *
  ServerAliveInterval 60
  ServerAliveCountMax 2

The lines tell your computer to send two “alive” signals every 60 seconds before allowing the connection to be terminated. This change must be done on your laptop or client computer. If you are connecting from a Windows computer, you will have to check the documentation of your ssh client to set the ‘KeepAlive’ interval.

What shell am I using? Can I use a different shell?

The default shell for new accounts is bash. You can check what your current shell is by typing the command [user@hpc-login3 ~]$ echo $0.

If you’d like to change the shell you are using you can type ‘bash’ or ‘csh’ after the echo $0 command to temporarily use a new shell. If you’d like to permanently change your default shell, send email to hpc@usc.edu.

Running Jobs on the Cluster

How do I run jobs on the cluster?

Jobs can be run on the cluster in batch mode or in interactive mode. Batch processing is performed remotely and without manual intervention. Interactive mode enables you to test your program and environment setup interactively using the salloc command. When your job is running interactively as expected, you should then submit it for batch processing. This is done by creating a simple text file, called a SLURM script, that specifies the cluster resources you need and the commands necessary to run your program.

For details and examples on how to run jobs, see the Running a Job on HPC using Slurm page.

How can I tell when my job will run?

After submitting a job to the queue, you can use the command squeue -j <job_id> --start where <job_id> is the reference number the job scheduler uses to keep track of your job. The squeue command will give you an estimate based on historical usage and availability of resources. Please note that there is no way to know in advance what the exact wait time will be and the expected start time may change over time.

How can I tell if my job is running?

You can check the status of your job using the myqueue command or the squeue <-u username> command. If your job is running but you are still unsure if your program is working, you can ssh into your compute nodes and use the command top to see what is running.

In general, we recommend that users request an interactive session to test out their jobs. This will give you immediate feedback if there are errors in your program or syntax. Once you are confident that your job can complete without your intervention, you are ready to submit a batch job using a SLURM script.

How do I tell if my job is running on multiple cores?

You can check the resources your program is consuming using the ‘top’ process manager:

    1. Request an interactive compute node using the salloc command.
    2. Run your job.
    3.  >
      [ttrojan@hpc-login3 ~]$ salloc --ntasks=8  
      ----------------------------------------
      Begin SLURM Prolog Wed 21 Feb 2018 02:34:35 PM PST
      Job ID:        767
      Username:      ttrojan
      Accountname:   lc_usc1
      Name:          bash
      Partition:     quick
      Nodes:         hpc3264
      TasksPerNode:  15(x2)
      CPUSPerTask:   Default[1]
      TMPDIR:        /tmp/767.quick
      Cluster:       uschpc
      HSDA Account:  false
      End SLURM Prolog
      ----------------------------------------
      [ttrojan@hpc3264 ~]$mpirun find
      
    4. Open a second terminal window, ssh to your compute node, and run the top command. This will display the processes running on that node.
    5. [ttrojan@hpc-login3 ~]$ ssh hpc3264
      [ttrojan@hpc3264]$ top
      top - 15:37:36 up 21:50,  1 user,  load average: 0.00, 0.01, 0.05
      Tasks: 285 total,   1 running, 284 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem : 65766384 total, 64225800 free,   970788 used,   569796 buff/cache
      KiB Swap:  8388604 total,  8388604 free,        0 used. 64535076 avail Mem
      
        PID USER         PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
       15191 ttrojan     20   0    139m   5684   1500 R   2.7  0.0   0:00.04 mpirun
       15195 ttrojan     20   0   15344    996    768 S   1.3  0.0   0:00.08 find
       15196 ttrojan     20   0   15344    996    768 S   1.3  0.0   0:00.08 find
       15199 ttrojan     20   0   15344    996    768 S   1.3  0.0   0:00.08 find
       15203 ttrojan     20   0   15344    996    768 S   1.3  0.0   0:00.08 find
       15204 ttrojan     20   0   15344    996    768 S   1.3  0.0   0:00.08 find
       15205 ttrojan     20   0   15344    996    768 S   1.3  0.0   0:00.08 find
       15206 ttrojan     20   0   15344    996    768 S   1.3  0.0   0:00.08 find
       15207 ttrojan     20   0   15344    996    768 S   1.3  0.0   0:00.08 find
      
    6. The number of your job processes should match the number of tasks (ntasks) requested in your salloc command.

If you see only one process, your job is using only one core.

How do I create/edit a text file?

HPC supports the following UNIX editors: vim (vi), nano, and emacs. Nano is the editor we teach in our workshops because of its ease of use.

Additional information on UNIX editors can be found in the UNIX Overview section of the ITS website.

How do I create a SLURM file?

A SLURM file, or script, is a simple text file that contains your cluster resource requests and the commands necessary to run your program. See the Running a Job on the HPC Cluster page for instructions on how to create and use SLURM scripts.

Can I use the local storage on a compute node?

Yes, you can temporarily access the local disk space on a single node using the $TMPDIR environment variable in your job scripts. For a multi-node job you can use the /scratch file system as your working directory for all jobs.

For more information, see the Temporary Disk Space page.

How do I specify which account to submit a job to?

If you are only part of a single account there is no need to specify which account to use. If you are part of multiple accounts the resource manager will consume the core-hours allocation of your default group unless you specify a different one. To avoid confusion it is best to specify which group’s allocation to use in your SLURM script like so:

#SLURM --account=<account_id>

where <account_id> is your account name of the form lc_xxx.

How do I report a problem with a job submission?

If a job submission results in an error, please send an email to hpc@usc.edu. Be sure to include the job ID, error message, and any additional information you can provide.

How do I make my HPC program run faster?

Each program is unique so it is hard to give advice that will be useful for every situation. For some suggestions that you can use as a starting point, see Optimizing Your HPC Programs.

If you need more detailed help, request a consultation with our research computing facilitators by emailing hpc@usc.edu.

Why is my job stuck in the submission queue?

When running jobs on HPC, you may experience jobs that get stuck in the submission queue. For information on why that may be happening and how to get your jobs “unstuck,” see the HPC Jobs Stuck in the Queue page.

How do I know which allocation I should use to submit a job if I am in multiple HPC allocations? How do I specify a certain allocation?

To see a listing of all of your available accounts and your current core hour allocations in these accounts, use the following command:

mybalance -h

The default HPC allocation is used to run a job when no allocation is specified in the salloc, srun and sbatch commands.

You can override the default account by using the -A command:

sbatch --account=<account_id> myjob.slurm

For further details on salloc, srun, and sbatch, please read the official man pages available by typing the following on any HPC login node:

man salloc
man srun
man sbatch

Data Files and Disk Space

 

I accidentally deleted a file, is it possible to recover it?

Your project and home directories are backed up every day as well as once a week. Daily backups are saved for up to a week while weekly backups are saved for up to a month. In order to be a candidate for archiving, files must be closed and idle for at least 20 minutes. If you know the name and path of the file you deleted we can search your backup directory and attempt to retrieve it. We’re more likely to recover your file from a daily backup than a weekly one so contact us as soon as possible.

Which file system should I store my project data in?

HPC has several different file systems, as summarized in the table below.

Name Path Amount of Space Backed up? Purpose
Home ~, /home/rcf-40/<user> 1GB Yes Configuration files, personal scripts
Project /home/rcf-proj/<proj_name> Up to 5TB, shared amongst group members Yes Medium-term data storage while running HPC jobs
Staging /staging/<proj_name> 10TB per account No Short-term, high perfomance data storage
Temporary $TMPDIR (single node), $SCRATCHDIR (multi-node) Varies, depends on resources requested No, deleted at end of job Short-term (per job) high performance data storage. Not shared with other researchers

The home, project, and staging file systems are shared, which means that your usage can impact and be impacted by the activities of other users.

For more details on the staging and local scratch file systems, see the Temporary Disk Space page.

How do I share my project data with another user?

If the user is already a member of your project, the project PI can create a shared folder in the top level directory of the project and set its permissions to be writable by the group by using the commands below. NOTE: Only the project PI can create a directory.

$ mkdir shared
$ chmod g+wxs shared

If you would like to consistently share data with a user who is not in your group, it is best to either have the PI add them to your project group or apply for a second account together. If this is not possible and you still need to share storage, send an email to hpc@usc.edu to explore other options.

How do I check if I have enough disk space?

Before you submit a large job or install new software, you should check that you have sufficient disk space. Your home directory has limited space so you should use your project directory for research and software installation.

To check your quota, use the myquota command. Compare the results of both the Files Used and Files Soft sections and Bytes Used and Bytes Soft sections for each directory. If the value of Used is close to the value of Soft in either case, you will need to delete files or request an increase in disk space from the account application site.

$ myquota
---------------------------------------------
Disk Quota for /home/rcf-40/ttrojan 
            Used     Soft    Hard
     Files  1897     100000  101000
     Bytes  651.15M  1.00G   1.00G
---------------------------------------------
Disk Quota for /home/rcf-proj2/tt1 
            Used    Soft     Hard
     Files  273680  1000000  1001000
     Bytes  55.98G  1.00T    1.02T
---------------------------------------------

I’m running out of space, how do I check the size of my files and directories?

To check your disk usage, use the du command.

To list the largest directories from the current directory, use the following command:

$ du -s * | sort -nr | head -n10

  • du -s *: Summarizes disk usage of all files
  • sort -nr: Sorts numerically, in reverse order
  • head -n10: Shows the first ten lines from head

To list the top 10 largest files from the current directory, use the following command:

$ du . | sort -nr | head -n10

  • du .: Shows disk usage of current directory
  • sort -nr: Sorts numerically, in reverse order
  • head -n10: Shows the first ten lines from head

To see all other options for ‘du’, use the following command:

$ man du

Additional Questions?

For questions regarding HPC accounts, software on HPC, or education and training on HPC resources, click one of the following links: