IT Division Scientific Cluster Support - Service Level Agreement
I. Background
This document describes the IT Division support model for Linux based
computational clusters. It is intended to outline the expectations and
the limitations of this service.
Cluster require a high level of expertise to build and maintain. There are
also a high number of failure points inherent to cluster systems. We are
leveraging the economy of scale and experience supporting various cluster
systems to offer a service that is both valuable and inexpensive compared
to other cluster support offerings. In order to successfully do this,
we have standardized on an implementation model that allows scalability
from the perspective of the administrator and customizability so it can
meet many needs.
II. Introduction
A. Parties Involved
1) IT Division HPC Services Group
2) End-User
B. Purpose
The purpose of this Service Level Agreement (SLA) is to specify
the services and commitments of the Service Provider as well as the
expectations and obligations of the Customer.
III. Responsibilities and Metrics of Service Provider
A. The Service Provider agrees it will provide:
Basic Scientific Linux installation on the master node
Warewulf cluster implementation toolkit
MPI2 compatibility provided by OpenMPI
SLURM scheduler
Computer Room Space*
Purchase and procurement consulting support
Initial cluster build and setup
Cluster debugging and testing
Normal Operating System maintenance
Assistance with running user application code on cluster
Basic training on how to use cluster
CPPM Security compliance
System and network monitoring
Hardware monitoring using NHC
Faulty hardware replacement and troubleshooting
Cluster and related subsystem upgrades (as needed)
Crash recovery
* Note: Computer room space is provided to clusters in the HPCS Program.
on a monthly recharge basis. Because of limited data center resources,
hardware must be removed 5 yrs after date of purchase.
B. Hours of Operation
Business Hours: Monday through Friday 8am-6pm PST
IV. Responsibilities of the Customer
A. The customer agrees it will:
1. Select a POC and describe the process of obtaining help or
reporting problems to the end users.
2. Coordinate with the Service Provider on any major configuration
changes (i.e. network installation, changes in topology,
relocations, etc...
3. Customer shall maintain site conditions within recommended
environment range of all systems, devices, and media covered.
4. Provide feedback to improve the service.
5. Develop end-user contingency operations plans and capabilities.
6. Identify what resources will be matrixed or transferred to the
Service Provider, if applicable.
7. Provide the Service Provider with access to equipment both
electronically (passwords) and physically (cardkey access, room keys),
as needed to provide service.
8. Provide authorization of Service Provider activities
(system upgrades, reboots, eta...)
9. Customer maintains final authority over the system(s) covered
under this agreement and will maintain awareness of their
responsibilities concerning the operation of system(s) under
Laboratory RPM policy. This includes computing security and backups.
B. To submit a request for help, the customer will:
1. Contact the IT Division Help Desk x4357 to submit a request for
help or send email to hpcshelp@lbl.gov
2. Include relevant contact info. (i.e. name, organization,
location, system hostname)
3. Provide a description of the problem, its urgency, and
potential mission impact.
4. Be available to provide the Service Provider with additional
information as needed.
V. General Maintenance Responsibilities
A. The following areas of concern need to be resolved before a Service
Level Agreement can take effect.
1. Verification and setup of customer system(s) of both software
and hardware.
2. Electronic and physical access to systems
B. Customer will be responsible for all expenses incurred for all
hardware and peripheral maintenance.
C. Customer will be responsible for all expenses incurred for any
application oriented software maintenance and licenses installed on
the system(s).
D. Customer with root access will void all service guarantees if their
actions are the direct cause to a system failure or security breach.
VI. Attachments.
A. Definitions and Terminology
B. Lists of supported hardware and software
1. Cluster Hardware Requirements:
* All nodes utilize Intel x86 type architecture
* Minimum of 10 nodes
* Concurs with the standard Beowulf spec (one master node, with
slave nodes on residing on a private subnet behind the master)
* Slave nodes do not support console logins, nor can they be
used as general workstations/servers
* All slave nodes only reachable from master node
* All slave nodes must support PXE boot using Warewulf
2. Cluster Software Requirements
* Scientific Linux 6 Linux operating system
* Warewulf cluster implementation toolkit
* SLURM job scheduler
* Intel compilers
* MPI2 compatibility provided by OpenMPI
3. Cluster Storage Hardware
* Low cost: Linux server with LSI RAID controller and SATA disks
* Recommended: Bluearc or Network Appliance file server
* High performance parallel: IBM GPFS storage, or Lustre parallel filesystem on Data Direct
Networks storage hardware.
4. Clusters that will be located in the 50B-1275 Computer room must
meet the following additional requirements
* Rack mounted hardware required.
* Equipment to be installed into APC Netshelter 42U computer racks.
* Equipment cooling is front (intake) to back (exhaust)
* Switched and metered 208V APC Rack PDUs
Prospective cluster owners should include the cost of these racks
into their budget
* Physical and root access is limited to HPCS staff
C. Exclusions
The HPCS program only provides for support directly related to the
cluster. Additional support for other aspects of the user computing
environment are available on a Time and Materials basis.
No direct support for application source debugging/engineering
Reinstallation of the cluster to an earlier OS release is not covered
by the SLA and will be done on a Time and Materials basis.
Backups are the responsibility of the cluster owner.
Backups can be provided by IT Division at additional cost
D. Service and Fees
1. Clusters are only managed under a monthly Service Level Agreement.
2. Cost factors can be dependent on the cluster design. If
all standards are followed, the basic cost will be $300/mo.
for the master node and $15/mo. for each additional compute node,
(e.g. Master node + 20 compute nodes = $600/month). There is an
additional charge for clusters with a high performance
network fabric such Infiniband $300/mo.
Storage servers are also charged at $300/mo.
Important Note: For configurations outside the standard, there will either be a
time and materials for the difference or an increased monthly
premium. These costs can usually be identified and explained during
initial consultations. Please note that these are direct costs. LBNL burdens
depend on the type of project. |