It seems that big data and the cloud go together like hand and glove. In fact, one of the major selling points of traditional virtualized cloud platforms is that they offer a convenient way to deploy the flexible infrastructure required for big data processing. Without elastic scaling, big data processing becomes much more expensive. And so, in the minds of many, big data has come to be seen as inseparable from cloud platforms like AWS: if you need to do big data analytics, you need the cloud.
And you do need to be doing big data analytics. It’s a much hyped term and the big data nut is certainly not entirely cracked as yet, but the advantages are undeniable. We’re generating ever more data, and now we have the means to leverage that data for valuable insights that can lead to business benefits ranging from operational and logistical improvements to new products.
When we talk about big data, we usually think about applications like Hadoop and MongoDB that are built expressly for the purpose of storing and processing huge amounts of data over distributed infrastructure platforms. So we have three components: the data, cloud infrastructure, and platforms like Hadoop.
What’s not often mentioned by virtualized cloud cheerleaders is that traditional cloud infrastructure platforms are not the optimal solution for running big data applications like Hadoop. Obviously, big data requires masses of data to be stored and moved around quickly, which means big data analytics put a serious burden on infrastructure I/O.
One of the major deficiencies of multi-tenant virtual cloud platforms is that they are not optimized for high-performance I/O.
In an ideal situation, Hadoop and the like would run in an environment where there was no contention for disk access, and where the application has direct access to the the buses and I/O ports of the server to maintain throughput.
Virtualized cloud platforms fail to provide optimal data throughput for two main reasons:
- They are multi-tenant environments — most public cloud platforms run dozens of virtual machines on a physical server, each of which has its own I/O requirements managed by the hypervisor. This is far from ideal for applications that require frequent reading and writing of large volumes of data.
- Cloud platforms use network attached storage — almost all public cloud platforms use storage area networks to provide storage for virtual machines. The drives are not attached to the servers over fast buses, they’re attached over network connections that are slower, less reliable, and suffer contention issues with other users of the cloud.
Bare metal clouds are the obvious solution. Bare metal clouds offer similar levels of flexibility to virtualized cloud platforms but dispense with the virtualization layer, the slow connections to data storage, and the resource contention issue — bare metal cloud servers are dedicated to one user. That makes them ideal for applications like big data analytics that require every ounce of performance and data throughput that servers can muster.
According to Charles Babcock of InformationWeek:
“Bare metal is well suited to tasks that require frequent imports of large amounts of data, such as applying inserts and updates to the database and doing quick analyses with export of results, such as analyzing activity on a social networking or large e-commerce site. In other words, bare metal shines on big data tasks associated with lots of I/O.”
Virtualized cloud platforms have long ridden the tails of the boom in big data analytics simply because they offer a convenient and flexible way to deploy infrastructure, but they are far from the only game in town. Bare metal clouds are much better suited to the processing and data I/O requirements of big data.