Tuesday, June 17, 2008

Dedicated Hardware is Wasted Hardware (and Power)

When mapping non-trivial software systems onto physical machines, there is a strong temptation to dedicate (usually over-powered) hardware for each distinct component within the system. This sort of segmented topology is convenient because it is easy to understand each machine's purpose and it dedicates the full capacity of each machine to its purpose.

While this approach has its benefits, it can be extremely wasteful of physical resources (cpu, storage, and memory capacity) and the power used to keep them running. While most applications are expected to be perpetually functional, the reality is that they are seldom heavily used all of the time. During relatively idle times, the machines are effectively reduced to space heaters, resulting in a double-whammy waste of power to run the machine, then to cool it.

The key to solving the wasted resource problem is different for each situation based on a variety of factors, including load pattern, number of distinct components, and size of the overall Data Center; there's no one-size-fits-all answer. The blade server/virtualization combination, which allows virtual machines to be migrated between physical servers on the fly, can play a role in some Data Centers. It is critical that the overall solution must be properly designed and periodically reviewed based on real metrics to maximize cost savings.


Let's look at an example company to see how the blade server/virtualization combination can save power (and therefore money) while simultaneously increasing application availability.

TeenBlabby.com (a fictitious company) is a live chat website targeted towards teenagers. Their production website currently runs on a two-node cluster of moderately powerful servers. They also have a two-node development cluster on similar hardware used entirely by employees. Here is a CPU utilization graph for the four servers:

In TeenBlabby.com's current configuration there are four servers heating the Data Center 24 hours a day.

The production servers peak at roughly 80% CPU utilization at around 8:30 PM. While the cluster can easily handle the peak load under normal conditions, if one of the servers were to experience a failure during the period from 2:00 PM to 1:00 AM, the remaining server could not effectively handle the full load on the production website.

The development servers peak at roughly 45% at around 12:30 PM. In contrast to the production cluster, this one can just about handle the full load in the event of a single-node failure.

Now let's look at how these applications would look in a blade server/virtualization configuration. Here's a graph of the aggregate CPU utilization for the four virtual servers:

The coincident load peak is just above 200% at roughly 7:00 PM. this means that at peak aggregate usage, all four virtual servers are using just over two machines-worth of CPU. Instead of running four full servers drawing full power all of the time, this new configuration with its migrating virtual servers never runs more than three servers and usually runs two or fewer. Not only does this save power because unused hardware is turned off, it also adds fault tolerance to the production cluster even during peak usage periods.

No comments: