IT Division Scientific Cluster Support - Service Level Agreement

I. Background

This document describes the IT Division support model for Linux based
computational clusters. It is intended to outline the expectations and
the limitations of this service.

Cluster require a high level of expertise to build and maintain. There are
also a high number of failure points inherent to cluster systems. We are
leveraging the economy of scale and experience supporting various cluster
systems to offer a service that is both valuable and inexpensive compared
to other cluster support offerings. In order to successfully do this,
we have standardized on an implementation model that allows scalability
from the perspective of the administrator and customizability so it can
meet many needs.

II. Introduction
  A. Parties Involved
	1) IT Division HPC Services Group
 	2) End-User
  B. Purpose
The purpose of this Service Level Agreement (SLA) is to specify
the services and commitments of the Service Provider as well as the
expectations and obligations of the Customer.

III. Responsibilities and Metrics of Service Provider

  A. The Service Provider agrees it will provide:
	Basic Scientific Linux installation on the master node 
	Warewulf cluster implementation toolkit 
	MPI2 compatibility provided by OpenMPI 
	SLURM scheduler 
	Computer Room Space*
	Purchase and procurement consulting support 
	Initial cluster build and setup 
	Cluster debugging and testing 
	Normal Operating System maintenance 
	Assistance with running user application code on cluster 
	Basic training on how to use cluster 
	CPPM Security compliance 
	System and network monitoring 
	Hardware monitoring using NHC
	Faulty hardware replacement and troubleshooting 
	Cluster and related subsystem upgrades (as needed) 
	Crash recovery

  * Note: Computer room space is provided to clusters in the HPCS Program.
  on a monthly recharge basis. Because of limited data center resources,
  hardware must be removed 5 yrs after date of purchase.

  B. Hours of Operation
	Business Hours:  Monday through Friday 8am-6pm PST

IV. Responsibilities of the Customer
  A. The customer agrees it will:
	1. Select a POC and describe the process of obtaining help or
	reporting problems to the end users.
	2. Coordinate with the Service Provider on any major configuration
	changes (i.e. network installation, changes in topology,
	relocations, etc...
	3. Customer shall maintain site conditions within recommended
	environment range of all systems, devices, and media covered.
	4. Provide feedback to improve the service.
 	5. Develop end-user contingency operations plans and capabilities.
	6. Identify what resources will be matrixed or transferred to the
	Service Provider, if applicable.
	7. Provide the Service Provider with access to equipment both
	electronically (passwords) and physically (cardkey access, room keys),
	as needed to provide service.
	8. Provide authorization of Service Provider activities
	(system upgrades, reboots, eta...)
	9. Customer maintains final authority over the system(s) covered
	under this agreement and will maintain awareness of their
	responsibilities concerning the operation of system(s) under
	Laboratory RPM policy. This includes computing security and backups.

  B. To submit a request for help, the customer will:
	1. Contact the IT Division Help Desk x4357 to submit a request for
	help or send email to
	2. Include relevant contact info. (i.e. name, organization,
	location, system hostname)
	3. Provide a description of the problem, its urgency, and
	potential mission impact.
	4. Be available to provide the Service Provider with additional
	information as needed.

V. General Maintenance Responsibilities
  A. The following areas of concern need to be resolved before a Service
     Level Agreement can take effect.
	1. Verification and setup of customer system(s) of both software
	and hardware.
	2. Electronic and physical access to systems
  B. Customer will be responsible for all expenses incurred for all
     hardware and peripheral maintenance.
  C. Customer will be responsible for all expenses incurred for any
     application oriented software maintenance and licenses installed on
     the system(s).
  D. Customer with root access will void all service guarantees if their
     actions are the direct cause to a system failure or security breach.

VI.	Attachments.
  A. Definitions and Terminology
  B. Lists of supported hardware and software
	1. Cluster Hardware Requirements: 
	* All nodes utilize Intel x86 type architecture 
	* Minimum of 10 nodes 
	* Concurs with the standard Beowulf spec (one master node, with
	  slave nodes on residing on a private subnet behind the master) 
	* Slave nodes do not support console logins, nor can they be
	  used as general workstations/servers 
	* All slave nodes only reachable from master node
	* All slave nodes must support PXE boot using Warewulf

	2. Cluster Software Requirements
	* Scientific Linux 6 Linux operating system 
	* Warewulf cluster implementation toolkit 
        * SLURM job scheduler 
	* Intel compilers
	* MPI2 compatibility provided by OpenMPI

	3. Cluster Storage Hardware
	* Low cost: Linux server with LSI RAID controller and SATA disks
	* Recommended:  Bluearc or Network Appliance file server
	* High performance parallel:  IBM GPFS storage, or Lustre parallel filesystem on Data Direct
	  Networks storage hardware.
	4. Clusters that will be located in the 50B-1275 Computer room must
     	   meet the following additional requirements 
	* Rack mounted hardware required.
	* Equipment to be installed into APC Netshelter 42U computer racks.
	* Equipment cooling is front (intake) to back (exhaust)
	* Switched and metered 208V APC Rack PDUs 
	  Prospective cluster owners should include the cost of these racks
	  into their budget
	* Physical  and root access is limited to HPCS staff 

  C. Exclusions
	The HPCS program only provides for support directly related to the
	cluster. Additional support for other aspects of the user computing
	environment are available on a Time and Materials basis.
	No direct support for application source debugging/engineering 
	Reinstallation of the cluster to an earlier OS release is not covered
	by the SLA and will be done on a Time and Materials basis.
	Backups are the responsibility of the cluster owner.
	Backups can be provided by IT Division at additional cost

  D. Service and Fees
	1. Clusters are only managed under a monthly Service Level Agreement.

	2. Cost factors can be dependent on the cluster design. If
	all standards are followed, the basic cost will be $300/mo.
	for the master node and $15/mo. for each additional compute node,
	(e.g. Master node + 20 compute nodes = $600/month). There is an
	additional charge for clusters with a high performance
	network fabric such Infiniband $300/mo.
	Storage servers are also charged at $300/mo.
        Important Note: For configurations outside the standard, there will either be a
        time and materials for the difference or an increased monthly
	premium. These costs can usually be identified and explained during
	initial consultations. Please note that these are direct costs. LBNL burdens
        depend on the type of project.