Hadoop on the SAN? That is, a hot, disruptive technology demanding new architecture paired with core IT's pet investment? Don't do it!
Many of the articles I write are based on projects Iโm currently engaged in. Recently, for example, Iโve found myself recruited in the war against the almighty SAN. You see, with big data projects involving Hadoop, when itโs time to procure hardware, you have to do something that many IT organizations havenโt done in years: Buy servers with local disks.
Thatโs because the โlocalityโ of resources is central to Hadoopโs performance โ while a SAN, by definition, consolidates storage on its own network. Yet buying servers with local disks flies in the face of IT organizationsโ nearly decade-old practice of purchasing only diskless blades and virtualized storage. Tradition dies hard, which is why some of us have reluctantly said, โYes, you can run Hadoop with a SAN,โ then added, under the breath, โโฆ but you shouldnโt.โ
Iโve done this myself, figuring weโd kick off the project and show how we could โoptimizeโ to local disks later. Let me say this unequivocally: You absolutely should not use a SAN or NAS with Hadoop. To understand why this is such a terrible idea, you have to understand a little about how MapReduce and HDFS work.
First off, HDFS is a distributed file system. Think of it as RAID over the Internet. What if you had a 10GB file and 10 servers, and each disk could burst 1GBps? Assume your Ethernet is also 10GB. Well, if you read the file from one server at one time and got back 1GBps, it would take 10 seconds. What if you could read all 10GB at once? In essence, that is what HDFS allows: a burst from your cluster thatโs bigger than the burst you could get from any individual node.
Secondly, the principal idea behind MapReduce is that the problem is broken up into pieces and sent to each node. From there, the answers are calculated in parallel, then sent back and combined (reduced). If you add in network hops and latency, along with multiple nodes contending for the same resources, you sort of defeat the โhigh performanceโ reason for choosing Hadoop in the first place.
Sure, you can make it not so bad by tacking each server to a different vPath and so on, but youโre still defiling your sports car with cheapo ethanol econo-gas, an automatic transmission, and $40 tires. Itโll work, but why didnโt you just buy a sensible file server or RDBMS and go home?
Unfortunately, core IT doesnโt like special cases โ so mark my words, conventional thinking about where storage should go presents a key opportunity for EMC and other vendors. Look for appliances that shove bits of Hadoop down to the hardware layer in a hybrid SAN/server setup to come out in the coming year or two. For now, however, stick to your guns and keep your HDFS distributed to local high-performance disks.
This article, โNever, ever do this to Hadoop,โ was originally published at InfoWorld.com. Keep up on the latest news in application development and read more of Andrew Oliverโs Strategic Developer blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.


