Building massive IaaS clouds

December 10, 2012 - 2:24 pm by sysadmin1138

David Nalley taught a class today on building massive IaaS clouds. Massive is a bit of a loose term, but the consensus is that’s “over 1000 physical nodes”. The session was fairly CloudStack centric, but the overall issues facing such deployments are common no matter what the actual framework is being used.

A key quote:

“Getting to thousands of physical hosts is complex -- getting to tens of thousands of physical hosts is a completely different magnitude of problem.”

When you’re working with large installations of this type you start running into scaling problems with systems we’ve all presumed to be good enough for anything. The 4096 VLAN limit is a rather significant problem for these types of systems, so providing network isolation to truly large installations takes creative thinking. Shared storage of any kind runs into scaling walls which needs to be managed around.

When building an IaaS cloud one of the most important decisions you will make will be the cloud framework you choose. CloudStack, where David contributes is one, but others such as OpenStack, vCloud and Eucalyptus are on the market. Due to it’s market penetration the Amazon API is something that a lot of the industry is already familiar with, so some of these frameworks attempt to emulate the same API or at least follow similar design philosophy.

“Amazon has customers with larger [AWS spend] than their nearest competitor.”

One controversial item David brought up was the use of local storage (a.k.a. Direct Attached Storage) in these very large deployments. It shows up a lot. Those of us who’ve been building internal virtualization platforms based on large centralized storage networks found this hard to believe, but David was firm in this.

Consider this though. If you have a truly large IaaS system, 30,000 nodes (that’s nodes, not Virtual Machines) for example, what does one disk dying mean in the course of day? Extremely little, it happens several times a day. At that scale you’re already going to have to be handling random VM failures, the same mechanism can deal with storage disappearing.

This is a radical departure for those of us who’ve only experienced “small” clouds, and an idea it’ll take some getting used to. Shared storage definitely does have a place, CloudStack uses it for holding Templates and Snapshots, but is not the pure foundation many of us thought it would be for such a large system.

David did a fair job conveying the completely different mindset needed to handle scaling, but it is definitely a topic that’ll take repeated applications for some of us to get. It truly is a Large Infrastructure, and things are different there.

Tags: