Critical factors

Five best practices are most critical to the optimal performance of a data center.

Operating and maintaining an enterprise-class data center is a serious and complex undertaking. Facilities must be secure and operate in a reliable and efficient manner. Several important factors contribute to the successful operation of a data center and to achieving these objectives on an ongoing basis. Critical to the optimal performance of a data center are a few key industry best practices; when followed correctly, these best practices will keep the operation of enterprise-class data centers in check.
 

Security

Use of monitoring systems. Monitoring systems are essential tools for data center personnel responsible for day-to-day operations. Monitoring systems enable the prevention or mitigation of environmental risks that may arise in a facility. These tools monitor critical assets and range from fire detection to CCTVs (closed-circuit televisions) to ingress/egress door activity to electrical systems to cooling to UPS (uninterruptible power supply) plants to generators and temperature detection, among others. Too much visibility into the health of a data center is impossible.

Highly trained technical personnel. While automated monitoring systems are essential, their effectiveness hinges on the capabilities of trained personnel who observe the data center at all times. Personnel responsible for overseeing the data center must have a thorough understanding of operational, safety and security and the associated procedural actions should an event occur.

In addition to the automated monitoring systems, periodic physical walkthroughs of the facility should be conducted every few hours. This visual inspection is meant to ensure optimal operation and security of the data center. It includes manually checking all secure entrances, CRAC (computer room air conditioning) units and key infrastructure elements, such as the electrical room and telecommunications room, and walking through the raised access floor.

Many times a major incident can be averted by proactively responding to an initial alarm. All activities and events that occur within a data center should be logged and archived for trending purposes.
 

Preventive maintenance

Preventive maintenance is paramount to managing a reliable data center. Operators must be proactive in mitigating risks and in detecting problems to ensure data center operation is not adversely affected.

Every system within the center should be documented and supported with a preventive maintenance regimen. For example, generators should undergo quarterly load tests for at least four hours; this should be done in a manner that simulates the loss of the utility. Generator fuel should be tested by a qualified laboratory and filtered at least once per year. Load banking should be done at least once per year to make sure the generators will operate at a full load for at least six hours. These tests are in conjunction with generator manufacturer maintenance programs that are based on hours of operation.

UPS systems are essential to maintaining the critical load in a data center prior to the start of the generator. Typically, UPS systems will be full-load tested when the generators are tested. The UPS units themselves should undergo the manufacturers’ recommended replacement intervals, including batteries, capacitors, fans and filters. Batteries, where applicable, serving these systems can have a significant impact on the reliability of the UPS itself. Batteries do fail and, if undetected, can weaken the entire string and cause and outage. Battery voltages and temperatures should be observed before, during and after testing is done.

All electrical switchgear, including automatic transfer switches (ATS) should be subject to testing. The ATS will be tested during the full-load test, however, quarterly preventive maintenance practices also should apply to these units. At a minimum, thermographic scans should be done for all breakers annually. This will identify any resistive heat problems because of loose connections or potential defective breakers. On a periodic basis, main breakers and switchgear should be bench tested to ensure operation. These tests include high-potential (hipot) testing and breaker injection testing. All breaker trip settings should be verified with the electrical coordination study that is applicable to that location.

Additional systems, such as smoke detection systems and fire suppression systems, should be included in the preventive maintenance schedule. The testing of these systems is subject to local code. It is important to note that as these various systems are tested the alarms and alarm naming should be verified within the monitoring system.

Depending on the type of cooling systems used within the data center, preventive maintenance practices can vary. Common across most systems is the replacement of filters in the CRAC units. There are cooling methodology specific preventive maintenance activities based on the type and manufacturer.
 

24X7X365 staff and support

Full-time staff. The ideal scenario is to have full-time personnel working within the data center. Problems do occur in data centers from time to time; when this happens, it is essential to have access to well-trained technical personnel who can quickly isolate and resolve any problem. This ensures continuous uptime and reliability of the data center.

A strong relationship with key service providers and suppliers. Supplier response times should be well-documented, and personnel should be familiar with the infrastructure equipment being supported. Where feasible, the data center should keep emergency spare equipment in inventory or close by within the supplier’s inventory. This can help to reduce downtime in the event of a failure.

Documentation is sometimes overlooked when operating a data center. Current electrical and mechanical diagrams should be present in the electrical and mechanical rooms. Personnel should regularly review diagrams to remain familiar with the design. This can reduce the preparation time for all planned and unplanned activities. Manuals and building blueprints should be maintained in a central area. A duplicate set of all site documentation should be kept off-site in the event of a disaster. A consolidated contact and escalation list should be accessible in the event of an emergency.

All preventive and corrective activities that occur within the data center should be covered by a documented change control process. When performing any work within the data center, a detailed method of procedure (MOP) should be written to cover the entire process, including estimated times and roll-back procedures.
 

Reliability

Efficiency is in a delicate balance with reliability. Efficiency can be achieved only if the reliability criterion is met.

Temperature control. Temperature control will provide the biggest efficiency impact on the data center. Efficiency maximization can be limited by the pre-existing building and cooling infrastructure. Wireless temperature and humidity sensors should be placed throughout the facility to obtain better visibility into this area. A computational fluid dynamics (CFD) study can provide additional information on the airflow within the data center.

It is important to understand the cooling watts density of the data center and take corrective action to reduce the load if the density design capacities are being exceeded. All power provisioning should take the watts density into account to avoid future heat problems and cooling inefficiencies.

The CRAC units need to be synchronized to ensure they are not in conflict with each other. Various manufacturers offer synchronization within their systems. Improved perforated floor tiles are available for raised floors to direct more air to areas where it is most needed.

Historical tracking of cooling efficiencies and heat load. Tracking can provide a means to document the tangible impact of various changes. Server workloads constantly change within data centers. To maintain an efficient cooling system, the ambient temperature must be monitored regularly and adjusted as required.
 

Third-party audits

The successful operation of a data center starts with applying these best practices and internal checks and balances. However, it is important to bring in independent third-party auditors to verify and attest that the appropriate practices and procedures are being followed.

With these best practices in place, data centers can maintain efficient and reliable operations.

 


Colleen Plank is a technical writer with Expedient Data Centers, Garfield Heights, Ohio. She can be contacted at colleen.plank@expedient.com.

April 2014
Explore the April 2014 Issue

Check out more from this issue and find your next story to read.