Chapter 3. MRG Grid Benefits and Features

Chapter 3. MRG Grid Benefits and Features

Benefits

MRG Grid provides significant benefits and value for enterprises, including:

Power

MRG Grid can process the largest computational workloads, from massively parallel High Performance Computing jobs to long-running High Throughput Computing jobs

Peak Workload Handling

MRG Grid can add on-demand computational power for handling peak loads through capabilities from cycle-stealing on Linux or Windows desktop computers and scheduling to remote grids

Flexibility

MRG Grid provides complete flexibility, from running high-burst to lengthy computations, in a centralized or distributed grid, and running jobs on various platforms including Linux and Windows. Furthermore, MRG Grid can schedule virtualized environments and workloads for the upmost flexibility in utilizing infrastructure.

Powerful Management Tools

Managing MRG Grid is simplified by leveraging the Red Hat Enterprise MRG unified, browser-based management console. The Red Hat Enterprise MRG integrated management tools enable administrators to manage, configure, provision, deploy, and monitor their grid deployments using the same tools they use for MRG Messaging and MRG Realtime.

Features

MRG Grid provides a broad set of features across both High Throughput Computing and High Performance Computing, including:

Virtualization

Allows for submission of a virtual machine (VM) as a user job, supporting migration of the VM

Dedicated and Undedicated Node Management (Cycle-Stealing)

Allows for dedicated resources (clusters) to be augmented with otherwise undedicated (desktops) using flexible policies

Multiple Standards-Based APIs

Web Service interface provides job submission and management functionality; CLI provides a highly scriptable, with consistent output, interface to all functionality

Security

Authentication using multiple mechanisms

Privacy provided by network encryption

Integrity of network traffic

Authorization through flexible configuration policies

Federated Grids/Clusters

A mechanism known as flocking allows independent pools to use each others' resources, controllable by customizable policies

Management Tools

Powerful browser-based management tools for managing daemons and machines, security, compute jobs, scalability settings, priorities, and more. Also provides sophisticated monitoring capabilities.

Workflow Management

The ability to specify job dependencies, via DAGMan, allows for construction and execution of complex workflows

The ability to schedule data placement, via Stork, assists in creation of workflows that intelligently handling data

Accounting

User and group resource utilization is tracked and accessible to adminstrators

ClassAds

A flexible language for policy and meta-data description

Policies

Flexible, customizable policies specified by jobs and resources via ClassAds

High Availability

The Negotiator and Collector, via HAD, and the Schedd, via Schedd Fail-over, can have their state replicated to allow for graceful fail-over upon service disruption

Disk Space Management

Through a multi-protocol storage management system, called NeST, the ability to manage (allocate, free, reserve, etc) disk space is exposed to a user's jobs

Database Support

All data about jobs and resources can be stored in a database via Quill

Compute On-Demand (COD)

The ability for a node or set of nodes to be claimed by a user in such a way that others may use the claimed nodes until the user needs them

Dynamic Pool Creation

Through a technology known as Glide-ins, nodes can be dynamically added to a pool to service user jobs

Priority Based Scheduling

Priority scheduling is performed at the granularity of a user

Fair-share scheduling can be performed on groups of users

Priority management is controllable by adminstrators

Account Remapping

Allows for execution across administrative domains

Enhance security by using a restructed pool of users to run jobs on execute machines

Privilege Separation

Only a single, specialized, audited component requires root/administrator permissions on execute nodes

Parallel Universe

Provides an extensible framework for running parallel (including MPI) jobs

Co-allocation of compute nodes is done automatically

Framework implementation for MPICH1, MPICH2, and LAM provided

Java Universe

Explicit support of jobs written in Java

Time Scheduling for Job Execution (Cron)

Allows a job or multiple jobs to be started at specific times, with customizable policy for failures such as missed deadlines

Backfill

Allows otherwise unused nodes to run jobs provided by BOINC

File Staging

Support for automatic file staging, e.g. job input, and online file io (i.e. file streaming from submit to execute nodes) via Chirp and remote syscalls, in the absense of a shared filesystem

Master-Worker (MW)

A C++ framework allowing a single master process to allocate and manage multiple worker processes, which process data based on master specified policies

Condor-C

Allows for jobs in one queue to be moved to another queue

Hawkeye

Allows for automated monitoring of one or more pools