AWS Monitoring, Understanding the Tools

Wed, 25th September 2013, 18:37

Almost every webhost seems to be touting some form of cloud hosting services. New marketing ploys and falling prices, more reminiscent of the shared hosting marketplace, raise questions about whether you are getting what you think you are paying for in the virtual world. That nagging feeling that someone has sold more virtual slices of the pie than there is of the real pie. The way to know what you are getting is to monitor your slice. This is just good practice whether you are on a virtual environment or dedicated hardware. Otherwise, you are probably wasting resources or underserving your clients without a clue.

A quick primer on AWS virtual resources. If you purchase a service like t1.micro or m1.small some of the documentation refers to what a similar dedicated hardware resource could look like for performance estimation only. It does not mean that you are getting dedicated performance. You are buying a virtual slice that will behave similarly under normal usage by everyone that is sharing the virtual environment.

Not everyone will be using their slice normally, including you, sometimes. To understand what you are looking for, here are a couple of monitoring views to consider.

Amazon markets itself as 'a snap to set-up and our Amazon EC2/AWS monitoring tool adjusts automatically as your configurations changes. In essence, we do the work for you.' 

AWS does provides a basic monitoring tool called CloudWatch. This gives you basic status monitoring of your virtual instance for stat like average CPU utilization percentage, disk read / writes, network bytes in / out and summary counts on disk operations and statuses. Most admins are interested in CPU usage to know if they need to scale up or down on their operations. In some specialized cases, disk and network usage are the bottleneck but if you are a special case you already know what you are looking for. 

Looking closer at average CPU utilization, this may not be telling the whole story you need. Digging deeper on an instance using Linux you can access the “top” function and see a richer set of CPU utilization percentages:
  • User – this is your virtual instance
  • System – background functions of the hypervisor and OS
  • Interrupt – hardware interrupts beyond your control
  • Wait – time your instance spent waiting on input or output jobs to end
  • Steal – time your virtual machine spent waiting because the hardware was otherwise occupied
  • Idle – everyone is happy and there is nothing to do 

This is a view of the actual real hardware from the perspective of your virtual instance. So your instance and it is unlikely to ever get close to 100% utilization. Your instance is sharing with everyone else and will only be allow some of the processor time based on the hypervisor sharing algorithm.

So what's going on when your virtual monitor is showing at or near 100% utilization but hardware level monitor reports your instance is running at a low percentage like 30%? Remember, you only have a virtual slice that can max out if you are running a computational heavy operation.

So you have reached the limit of your slice according the hypervisor which is reported as 100% at the virtual level. Your instance now must give up some processing time and share with everyone else in the neighborhood. The hypervisor has decided your hardware allocation, 30% in this example, is the fair solution to keep everything running.

Steal is the important stat for understanding how you are getting along with the neighbors. It is the percentage of time your CPU access has been blocked because the CPU is being used for something else. This is a shared sandbox remember. It doesn't always mean that someone is taking your portion of the hardware that you are paying good money to access. It could be blocked because you have maxed out your share. It could also be blocked because someone else is pushing the boundary in another instance and it is allowed in some cases for short bursts of activity.

To confirm if it is you or the other guys when the steal is consistently running high, restart your instance on different hardware. If the steal is still running high, it is likely you need to consider bumping up your service plan to more CPU resources. If the still is not high anymore then you left a bad neighborhood and everything should be fine.

Everleap - Affordable Cloud Hosting