Research Computing, a division of Information Technology, has been established to promote the availability of high performance computing resources essential to effective research at the University of South Florida. Research Computing supports software tools, high performance computer hardware, and training for both faculty and students.

Our resources are freely available to faculty and students involved in research projects. We ask that you kindly acknowledge us in all publication for which our resources have been useful. Our preferred acknowledgment statement is available here. We also ask that you send an email to publications@rc.usf.edu to let us know of publications or grant requests in which our facilities are mentioned. We would also appreciate being kept informed of any grant requests that are fulfilled.

System Updates and Changes, Thurs. Dec. 15th - Fri. Dec. 16th

Following up on some work that was originally planned for October, but was postponed to allow for more testing, we will be performing several system changes and updates.

On Thursday, December 15th, at 9am, we will be taking the system down in order to do the following:

  1. Re-base and re-install all compute resources to Scientific Linux 6.  This brings with it the following benefits:
    • Numerous bug fixes and performance improvements
    • Updated InfiniBand stack for additional performance improvements
    • cgroup support which allows administrators to better isolate and segregate applications from one another, allowing more efficient allocation of system resources for users
  2. Re-base and re-install login resources to RedHat Enterprise Linux 6.  The same benefits that apply to the computational resources will apply here as well.
  3. Point all systems to new server locations for
    • /home
    • /opt
    • /home/shares

This change will involve

    1. Migrating those paths from their current NFS mounts to new GlusterFS mount points
    2. Moving /home/shares to /shares
    3. Moving /opt/apps to /apps
    4. Corresponding changes to module files so that, hopefully, most users don’t even notice the changes
The reasons for these changes are as follows
    • A GlusterFS configuration for /home, /apps, and /shares will be significantly less complex and more scalable than the current NFS/DRBD configuration, resulting in fewer outages, easier supportability for our staff, and enhanced performance
    • Providing a seamless, single view of the filesystems for our Tampa resources and our soon-to-be-released Winter Haven resources required us to use a different approach to making files available and necessitates this change
    • The mount-point location changes (/home/shares -> /shares, /opt/apps -> /apps) are also to simplify configuration and to be more in-line with how other centers configure their systems

The system should be available for use by end-of-day Friday after which we will be working to make available a new set of computational resources over the following week.  Please let us know if there are any questions and concerns regarding this change.

System Outage Update Dec. 3rd, 2011

As of 11:30pm, access to the login nodes has been re-enabled as we have successfully pulled a daily snapshot and can tolerate some additional load on the storage system while the array continues to rebuild.  Rebuild completion should be at around 50% so we estimate that the rebuild will be done by some time tomorrow.  Job submission is still disabled until the rebuild completes.

System off-line, Dec. 3rd, 2011

Due to the failure of two disk drives in the storage system for /home and various other directories, we will be disabling access to the queue until the array rebuild completes.  Any jobs that were running from the /home directory have been deleted. Jobs running from /work have been left to complete as they are on other storage.  Queued jobs are still waiting in the queue and should be left until the drive rebuild process completes.

The odds of another drive failure during a rebuild are generally high enough that we use storage technology that allows us to sustain a loss of two drives on a single volume.  This provides breathing room for a rebuild to complete.  In our case, two drives were lost simultaneously, so we are stuck with an unsafe probability that a third drive may fail during the rebuild process.  If a third drive fails, we will need to rebuild the entire array system and restore data from a backup, a process that may take several days.

In order to expedite the rebuild process, we have decided to reduce disk activity as much as possible so that the rebuild can complete as quickly as possible, reducing the risk that another drive failure may occur during that time period.  This is the reason why running jobs on /home have been deleted and access to the queues has been disabled.  We will also be disabling access to the login nodes as well so that users do not kick off any disk intensive workloads.

We apologize for the inconvenience and are taking steps to ensure that the rebuild process proceeds as quickly and as safely as possible.

Be advised that research computing keeps at least three copies of all data on /home.  The disk drive failures pose very little risk in terms of permanent data loss.  All data is backed up nightly to a secondary storage array and from there to a tape backup system.  Data is also replicated to a secondary array in our secondary datacenter.

Thursday, Nov. 3rd Maintenance

Starting at 5:00pm on Thursday, Nov. 3rd, we’ll be performing some system changes to prepare for our storage migration to GlusterFS.  This was put off from last month in order to do some more stress testing and verification.  This window should last no more than 2 hours.  Only intermittent connectivity issues are expected and jobs should not be affected by the work.

System Changes on Thursday, Oct. 6th

On Thursday, October 6th, at 5pm, we will be taking the system down in order to point all systems to new server locations for

  • /home
  • /opt
  • /home/shares

This change will involve

  1. Migrating those paths from their current NFS mounts to new GlusterFS mount points
  2. Moving /home/shares to /shares
  3. Moving /opt/apps to /apps
  4. Corresponding changes to module files so that, hopefully, most users don’t even notice the changes
The reasons for these changes are as follows
  • A GlusterFS configuration for /home, /apps, and /shares will be significantly less complex and more scalable than the current NFS/DRBD configuration, resulting in fewer outages, easier supportability for our staff, and enhanced performance
  • Providing a seamless, single view of the filesystems for our Tampa resources and our soon-to-be-released Winter Haven resources required us to use a different approach to making files available and necessitates this change now so that we are ready for the new system deployment in October.
  • The mount-point location changes (/home/shares -> /shares, /opt/apps -> /apps) are also to simplify configuration and to be more in-line with how other centers configure their systems
The work will take 2 hours to complete, meaning that the system should be released for service around 7pm that same evening.
Please let us know if there are any questions and concerns regarding this change.