Over the last few weeks, system utilization has gone up considerably but this utilization is not optimal. Certainly, load averages, CPU usage, and memory usage have increased, but there are still considerable under-utilized resources on the system. Despite the existence of under-utilized resources, job wait times have increased greatly. What is the problem here?
One of the big issues is that users tend to avoid volatile queues (formerly known as low-priority queues) for a few reasons:
- The aren’t aware that they can use them (e.g. specifying p_low=true or t_volatile=true)
- They do not want to risk having their job restarted
- Their job is not able to be restarted gracefully (i.e. checkpointed and re-launched)
The third point is something that cannot be solved unless the code being used is re-implemented to support checkpointing. Checkpointing is the ability for an application to be stopped at some point during a simulation and to be restarted later without having to re-do all of the computations that were done up to that particular point.
After conducting an audit of our application base, we’ve found that a number of the applications utilized by Circe users do in fact support checkpointing. We are working on building checkpoint environments within the GridEngine configuration for each of these applications and will make this a standard practice for application deployments from now on. The beautiful thing about being able to utilize checkpointing on the Circe system is that this, along with utilizing volatile queues, can greatly increase system utilization and at the same time, greatly reduce job wait times and increase overall throughput.
To better understand how this works, it is necessary to know the difference between a volatile resource and a non-volatile resource. On Circe, a non-volatile resource is a queue that, barring any hardware failures or node outages, is guaranteed to finish a job to completion. Other user’s jobs are unable to affect the execution of your application and once your job lands on a set of nodes, it will run on that set until it is finished (or until your maximum defined run-time is reached).
A volatile resource, on the other hand, is a queue where your job may be periodically restarted and moved to a different set of nodes. This can happen because the nodes your job was assigned to are actually owned by a particular resource group who wishes to use them at some point during your job execution. As a result of this, your job is pre-empted and restarted on a different set of resources. Applications that support checkpointing can do this seamlessly and automatically restart from the last known point in execution. This allows a user to make much greater use of group-owned hardware, avoiding the need to wait in long lines for centrally-owned hardware. Obviously, for long-running applications that do not support check pointing, you’ll have to rely on using non-volatile resources.
So, in theory, how would we do this? Once we’ve enabled a checkpointing environment for your application, you can do the following with a GridEngine submit script:
#$ -N myjobname
#$ -cwd
#$ -j y
#$ -o output.$JOB_ID
#$ -pe ompi* 8
#$ -l h_rt=40:00:00,volatile=true,ib=true
#$ -ckpt myapplication.ckpt
#$ -r y
sge_mpirun myapplication
The keys here is specifying volatile=true to enable use of volatile resources, -ckpt myapplication.ckpt to specify the use of a checkpoint environment, and -r y to tell the scheduler that this job is re-startable. This will allow your job to utilize un-used research group-owned resources (which tend to be very high-end hardware) and react gracefully to your job being potentially pre-empted by those of said research group. The beautiful part about this is that this gives users a much larger pool of hardware to utilize to run their jobs, increasing their job throughput and overall productivity.
For applications that are integrated with the ‘run’ command, checkpointing will be enabled by default (if your application supports it), and your job will automatically opt to use volatile resources by default. If you wish to override this default behavior and run on non-volatile resources, simply adding the -F flag (it is important that this is capitalized) or --finish to your command line will work, e.g.:
[user@host ~]$ run -n 8 --finish -t 40:00:00 -f ib myapp/1.0
Checkpoint environments are not available on Circe just yet, but will be coming online within the next couple of weeks for preliminary testing. We’ll post updates here with more information as we progress. Check the documentation for your application at the Software Portal to see if it supports checkpointing. Be advised that currently, the documentation for your application may not have been updated yet as we have not enabled the checkpoint environment for it. The documentation for your application will be updated as we work through our list of updates.

