System off-line, Dec. 3rd, 2011

Due to the failure of two disk drives in the storage system for /home and various other directories, we will be disabling access to the queue until the array rebuild completes.  Any jobs that were running from the /home directory have been deleted. Jobs running from /work have been left to complete as they are on other storage.  Queued jobs are still waiting in the queue and should be left until the drive rebuild process completes.

The odds of another drive failure during a rebuild are generally high enough that we use storage technology that allows us to sustain a loss of two drives on a single volume.  This provides breathing room for a rebuild to complete.  In our case, two drives were lost simultaneously, so we are stuck with an unsafe probability that a third drive may fail during the rebuild process.  If a third drive fails, we will need to rebuild the entire array system and restore data from a backup, a process that may take several days.

In order to expedite the rebuild process, we have decided to reduce disk activity as much as possible so that the rebuild can complete as quickly as possible, reducing the risk that another drive failure may occur during that time period.  This is the reason why running jobs on /home have been deleted and access to the queues has been disabled.  We will also be disabling access to the login nodes as well so that users do not kick off any disk intensive workloads.

We apologize for the inconvenience and are taking steps to ensure that the rebuild process proceeds as quickly and as safely as possible.

Be advised that research computing keeps at least three copies of all data on /home.  The disk drive failures pose very little risk in terms of permanent data loss.  All data is backed up nightly to a secondary storage array and from there to a tape backup system.  Data is also replicated to a secondary array in our secondary datacenter.

You must be logged in to post a comment.