...
cloud server ! = cloud server
Filed In:
One of the more interesting facets of my work at Viximo is tracking EC2 cluster performance. Below you’ll see a graph of data collected and displayed through New Relic.

The graph shows throughput and average response time for our cluster over a three hour period. The vertical bar in the center represents a feature hotfix that most certainly would not affect performance. So why would response time drop so dramatically from 33ms to 27ms, a (33-27) / 33 = 18% difference?
Spinning up instances in EC2 is cheap, and restarting passenger results in starvation while rails initializes and is then forked into workers. Instead of restarting apache we spin up a new batch of servers, wait til they’re settled then re-point the load balancer. The graph is illustrating the differences in cloud hardware/instances.
Spinning up instances in EC2 is cheap, and restarting passenger results in starvation while rails initializes and is then forked into workers. Instead of restarting apache we spin up a new batch of servers, wait til they’re settled then re-point the load balancer. The graph is illustrating the differences in cloud hardware/instances.

The AWS zone we're in appears to have three different classes of hardware for small/medium instances, fast, slow and screwy as illustrated above. Slow as in 40% slower than the fast instance. Screwy as in what’s up with that reported cpu? (Known ubuntu kernel bug). At first we’d thought the instance performance differences could be explained by something obvious like multi-tenancy, however after lengthy capacity testing we’d found that the data in /proc/cpuinfo mattered significantly more than other factors such as multi-tenancy. The variation between a fast and slow server could be 25ms vs 40ms average response times. Inside the fast hardware class, multi-tenancy and other factors explain the observed range of 23-28ms average response times.
Back to the inital graph, the average response time for the 6 instances is (25.2 + 24.3 + 41.4 + 25.1 + 26.5 + 29.9) / 6 = 28.7. Now change two of the fast instances into slow instances (eg 24.3 => 41.1), (25.2 + 41.1 + 41.4 + 40.5 + 26.5 + 29.9) / 6 = 34.1. Boom, we’ve got a (34.1 - 28.7) / 34.1 = 16% difference.
So what does this mean for Viximo’s day to day operations? Most of the time the differences are ignored. Occasionally we find it worth the effort to weed out slow servers for long running rarely modified services. Sometimes it does matter such as a Ruby app tier where cpu is the limiting factor and the cpu differences can either degrade performance or conversely cost us money. For those we must capacity test each class of hardware then factor in observed ratios for cluster sizing.
Back to the inital graph, the average response time for the 6 instances is (25.2 + 24.3 + 41.4 + 25.1 + 26.5 + 29.9) / 6 = 28.7. Now change two of the fast instances into slow instances (eg 24.3 => 41.1), (25.2 + 41.1 + 41.4 + 40.5 + 26.5 + 29.9) / 6 = 34.1. Boom, we’ve got a (34.1 - 28.7) / 34.1 = 16% difference.
So what does this mean for Viximo’s day to day operations? Most of the time the differences are ignored. Occasionally we find it worth the effort to weed out slow servers for long running rarely modified services. Sometimes it does matter such as a Ruby app tier where cpu is the limiting factor and the cpu differences can either degrade performance or conversely cost us money. For those we must capacity test each class of hardware then factor in observed ratios for cluster sizing.

Add new comment