Fall HPC Downtime (September 23 – September 30)
Beginning at 9:00 a.m. on Monday, September 23, the entire HPC cluster—including all head nodes, filesystems, almaak machines, and compute nodes—will be unavailable due to our fall 2013 maintenance. We will be working on the head nodes, filesystems, and Myrinet-based compute nodes first and expect to have them back online by 9:00 a.m. on Tuesday, September 24. The GPU/Infiniband-based cluster will have an extended downtime until Monday, September 30, at 9:00 a.m due to hardware expansion. We will also be running stress tests to ensure system integrity.
During this downtime, we will be applying security and operating patches to all file servers and firmware patches to various storage devices. We will be changing to newer versions the default symlinks for software installed under /usr/usc. For example, the current default for Matlab is version 2011a; after the downtime, the default Matlab will be 2013a. A list of all updated default symlinks will be provided.
As for new hardware, we will be adding 208 HP SL230s non-GPU compute nodes to the Infiniband cluster. Each node will have dual proc/eight-core Intel E5-2665 2.4GHz CPUs, 128 GB memory, 1TB internal drive, and a 56.6Gbps Infiniband connection. In addition, we will be unveiling a new 328TB OrangeFS-based distributed filesystem called “/staging”. Use of “/staging” will be similar to the “/scratch” filesystem, but data on it will be available before and after cluster job runs. This filesystem will give users the ability to temporarily “stage” a very large dataset, run several different jobs on it, and later copy any relevant results to their project filesystem for long-term storage. The filesystem “/staging” will not be backed up and will be regularly wiped during each downtime. Additional information regarding “/staging” will be provided soon.
In addition, we will be enabling auto-logout after 20 minutes of inactivity on all of the head nodes. This year, we are working toward HIPAA compliance on the HPC cluster, and one requirement is that connections to the system must enable an automatic logoff mechanism after a predetermined time of inactivity. The auto logout mechanism alone is not enough for HIPAA compliance, but it is one of the many steps necessary to make HPC compliant in the future.
If you have any questions or concerns about this downtime, please contact us at firstname.lastname@example.org.