It's a new world full of shiny toys, but some have sharp edges. Don't hurt yourself or others. Learn to play nice with them
Yes, you can haz big data. However, you can haz it the right way or the wrong way. Here are the top 10 worst practices to avoid.
1. Choosing MongoDB as your big data platform. Why am I picking on MongoDB? Iโm not, but for whatever reason, the NoSQL database most abused at this point is MongoDB. While MongoDB has an aggregation framework that tastes like MapReduce and even a (very poorly documented) Hadoop connector, its sweet spot is as an operational database, not an analytical system.
When your sentence begins, โWe will use Mongo to analyze โฆ,โ stop right there and think about what youโre doing. Sometimes you really mean โcollect for later analysis,โ which might be OK, depending on what youโre doing. However, if you really mean youโre going to use MongoDB as some kind of sick data-warehousing technology, your project may be doomed at the start.
2. Using RDBMS schema as files. Yeah, you dumped each table from your RDBMS into a file. You plan to store that on HDFS. You plan to use Hive on it.
First off, you know Hive is slower than your RDBMS for anything normal, right? Itโs going to MapReduce even a simple select. Look at the โoptimizedโ route for โtableโ joins. Next, letโs look at row sizes โ whaddaya know, you have flat files measured in single-digit kilobytes. Hadoop does best on large sets of relatively flat data. Iโm sure you can create an extract thatโs more denormalized.
3. Creating data ponds. On your way to creating a data lake, you took a turn off a different overpass and created a series of data ponds. Conwayโs law has struck again and youโve let each business group not only create their own analysis of the data but their own mini-repositories. That doesnโt sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data. I donโt mean flat versus cube โ I mean different answers for some of the same questions. Schema-on-read doesnโt mean โdonโt plan at all,โ but it means โdonโt plan for every question you might ask.โ
Nonetheless, you should plan for the big picture. If you sell widgets, there is a good chance someoneโs going to want to see how many, to whom, and how often you sold widgets. Go ahead and get that in the common formats and do a little up-front design to make sure you donโt end up with data ponds and puddles owned by each individual business group.
4. Failing to develop plausible use cases. The idea of the data lake is being sold by vendors to substitute for real use cases. (Itโs also a way to escape the constraints of departmental funding.) The data-lake approach can be valid, but you should have actual use cases in mind. It isnโt hard to come up them in most midsize to large enterprises. Start by reviewing when someone last said, โNo, we canโt, because the database canโt handle it.โ Then move on to โduh.โ For instance, โbusiness developmentโ isnโt supposed to be just a titular promotion for your top salesperson; itโs supposed to mean something.
What about, say, using Mahout to find customer orders that are common outliers? In most companies, most customer orders resemble each other. But what about the orders that happen often enough but donโt match common ones? These may be too small for salespeople to care about, but they may indicate a future line of business for your company (that is, actual business development). If you canโt drum up at least a couple of good real-world uses for Hadoop, maybe you donโt need it after all.
5. Thinking Hive is the be-all, end-all. You know SQL. You like SQL. Youโve been doing SQL. I get it, man, but maybe you can grow, too? Maybe you should reach deep down a decade or three and remember the young kid who learned SQL and saw the worlds it opened up for him. Now imagine him learning another thing at the same time.
You can be that kid again. You can learn Pig, at least. It wonโt hurt โฆ much. Think of it as PL/SQL on steroids with maybe a touch of acid. You can do this! I believe in you! To do a larger bit of analytics, you may need a bigger tool set that may include Hive, Pig, MapReduce, Uzi, and more. Never say, โHive canโt do it, so we canโt do it.โ The whole point of big data is to expand beyond what you could do with one technology.
6. Treating HBase like an RDBMS. You went nosing around Hadoop and realized indeed there was a database. Maybe you found Cassandra, but most likely you found HBase. Phew, a database โ now I donโt have to try so hard! Trust me, HDFS-plus-Hive will drain less glucose from your head muscle (IANAD).
The only real commonality between HBase and your RDBMS is that both have something resembling a table. You can do things with HBase that would make your RDBMSโs head spin, but the reverse is also true. HBase is good for what HBase is good for, and it is terrible at nearly everything else. If you try and represent your whole RDBMS schema as-is in HBase, you will experience a searing hot migraine that will make your head explode.
7. Installing 100 nodes by hand. Oh my gosh, really? You are going to hand-install Hadoop and all its moving parts on 100 nodes by Tuesday? Nothing beats those hand-rolled bits and bytes, huh? That is all fun and good until someone loses a node and youโre hand-rolling those too. At the very least, use Puppet โ actually, use Ambari (or your distributionโs equivalent) first.
8. RAID/LVM/SAN/VMing your data nodes. Hadoop stripes blocks of data across multiple nodes, and RAID stripes it across multiple disks. Put them together, what do you have? A roaring, low-performing, latent mess. This isnโt even turducken โ itโs like roasting a turkey inside of a turkey. Likewise, LVM is great for internal file systems, but youโre not really going to randomly decide all hundred of your data nodes need to be larger, instead of, like, adding a few more data nodes.
And your SAN, your holy SAN โ loved by many, I/O bound, and latent to all. Youโre using HDFS for a higher burst rate, so now youโre going to stick everything back in the box? The idea is to scale horizontally โ how are you going to do that across the same network pipe to the same box oโ disks?
Hey, EMC will sell you more SAN, but maybe you need to think outside the box. VMs are great. However, if you want high-end performance, I/O is king. Fine, you can virtualize the name node and much of the rest of Hadoop, but nothing beats bare metal for data nodes. You can achieve much of the same advantage as virtualization with devops tools. Even most of the cloud vendors are offering metal options.
9. Treating HDFS as just a file system. If you dump stuff onto HDFS, you havenโt necessarily accomplished anything. The tooling around it is important, of course. Now you can Hive, Pig, and MapReduce it, but you have to think a bit about what, why, and where youโre dumping things onto HDFS. You need to think about how youโre going to secure all of this and for whom.
10. Whoo, shiney! Also known as, โtoday is Thursday, letโs move to Spark.โ Yes, Hadoop is a growing ecosystem, and you want to stay ahead of the curve. I feel you, man, but letโs remember that freedom is just another word for nothing left to lose. Once you have real data and real users, you donโt have the same amount of freedom as when you had no real responsibility. Now you must have a plan.
Fortunately, you have the tools to manage that evolution and move forward responsibly. Maybe you donโt get to deploy this weekโs cool thing while it is fresh, but you donโt have to run Hadoop 1.1 anymore, either. As with any technology โ or anything in life โ find that moderate path that prevents you from being the last gazelle in the pack or the first lemming off the cliff.
This is the current top 10 Iโm seeing in the field. Howโs your big data project going? What anti-patterns or patterns have you found?
This article, โThe 10 worst big data practices,โ was originally published at InfoWorld.com. Keep up on the latest news in application development and read more of Andrew Oliverโs Strategic Developer blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.


