12.4 Distributed Resource Management
If you're running a
lot of BLAST jobs, one problem to consider is how to manage them to
minimize idle time without overloading your computers. Being
organized is the simplest way to schedule jobs. If
you're the only user, you can use simple scripts to
iterate over the various searches and keep your computer comfortably
busy. The problem starts when you add multiple users. In a small
group, it's possible for users to cooperate with one
another without adding extra software. Sending email saying
"hey, stay off blast-server5 until I say
so" works surprisingly well. But if you have a large
group or irresponsible users, you'll want some kind
of distributed resource management (DRM) software.
There are a
number of DRM software packages, both free and commercial. But even
the free ones will cost you time to install and maintain, and users
need training to use the system. Table 12-3 lists
some of the most popular packages in the bioinformatics community.
Condor is an established DRM that is downloadable for free. It is
rare in that it supports Windows and Unix. LSF is a mature product
with many bioinformatics users. It is, however, expensive. For large
groups, however, the robustness makes the cost justifiable. Parasol
is purpose-built for the UCSC kilocluster and throws out some of the
generalities for increased performance. PBS and ProPBS are popular
DRMs, and if you're an academic user, you can get
ProPBS for free. SGE is a relative newcomer but has a strong
following, partly due to the fact that it's an open
source project.
Table 12-3. DRM software |
Condor
|
Condor is a specialized workload management system for
compute-intensive jobs. Like other full-featured batch systems,
Condor provides a job-queuing mechanism, scheduling policy, priority
scheme, resource monitoring, and resource management. Users submit
their serial or parallel jobs to Condor; Condor then places them into
a queue, chooses when and where to run the jobs based upon a policy,
carefully monitors their progress, and ultimately informs the user
upon completion.
http://www.cs.wisc.edu/condor
|
LSF
|
Platform LSF 5 is built on a grid-enabled, robust architecture for
open, scalable, and modular environments.
Platform LSF 5 is engineered for enterprise deployment. It provides
unlimited scalability with support for over 100 clusters, more than
200,000 CPUs, and 500,000 active jobs.
With more than 250,000 licenses spanning 1,500 customer sites,
Platform LSF 5 has industrial-strength reliability to process
mission-critical jobs reliably and on time.
A web-based interface puts the convenience and simplicity of global
access to resources into the hands of your administrators and users.
Platform LSF 5, with its open, plug-in architecture, seamlessly
integrates with third-party applications and heterogeneous technology
platforms.
http://www.platform.com
|
Parasol
|
Parasol provides a convenient way for multiple users to run large
batches of jobs on computer clusters of up to thousands of CPUs.
Parasol was developed initially by Jim Kent, and extended by other
members of the Genome Bioinformatics Group at the University of
California Santa Cruz. Parasol is currently a fairly minimal system,
but what it does, it does well. It can start up 500 jobs per second.
It restarts jobs in response to the inevitable systems failures that
occur on large clusters. If some of your jobs die because of your
program bugs, Parasol can also help manage restarting the crashed
jobs after you fix your program.
http://www.soe.ucsc.edu/~donnak/eng/parasol.htm
|
PBS
|
The Portable Batch System (PBS) is a flexible batch queuing and
workload management system originally developed by Veridian Systems
for NASA. It operates on networked, multiplatform UNIX environments,
including heterogeneous clusters of workstations, supercomputers, and
massively parallel systems. Development of PBS is provided by the PBS
Products Department of Veridian Systems.
http://www.openpbs.org
|
ProPBS
|
The PBS Pro Version 5.2 workload management solution is the
professional version of the Portable Batch System. Built on the
success of OpenPBS, PBS Pro goes well beyond it with the features and
support you expect in a mission-critical commercial product, such as:
Shrink-wrapped, easy-to-install binary distributions Support on every major version of Unix and Linux Enhanced fault tolerance and scalability Enhanced scheduling algorithms Computational grid support Direct support from the team that created PBS New, rewritten documentation Source code availability
http://www.propbs.com
|
SGE
|
The Grid Engine project is an open source community effort to
facilitate the adoption of distributed computing solutions. Sponsored
by Sun Microsystems and hosted by CollabNet, the Grid Engine project
provides enabling distributed resource management software for
wide-ranging requirements from compute farms to grid computing.
http://gridengine.sunsource.net
|
|