Andrew C. Oliver
Contributing Writer

The 10 worst big data practices

analysis
Jul 17, 20148 mins

It's a new world full of shiny toys, but some have sharp edges. Don't hurt yourself or others. Learn to play nice with them

Yes, you can haz big data. However, you can haz it the right way or the wrong way. Here are the top 10 worst practices to avoid.

1. Choosing MongoDB as your big data platform. Why am I picking on MongoDB? Iโ€™m not, but for whatever reason, the NoSQL database most abused at this point is MongoDB. While MongoDB has an aggregation framework that tastes like MapReduce and even a (very poorly documented) Hadoop connector, its sweet spot is as an operational database, not an analytical system.

When your sentence begins, โ€œWe will use Mongo to analyze โ€ฆ,โ€ stop right there and think about what youโ€™re doing. Sometimes you really mean โ€œcollect for later analysis,โ€ which might be OK, depending on what youโ€™re doing. However, if you really mean youโ€™re going to use MongoDB as some kind of sick data-warehousing technology, your project may be doomed at the start.

2. Using RDBMS schema as files. Yeah, you dumped each table from your RDBMS into a file. You plan to store that on HDFS. You plan to use Hive on it.

First off, you know Hive is slower than your RDBMS for anything normal, right? Itโ€™s going to MapReduce even a simple select. Look at the โ€œoptimizedโ€ route for โ€œtableโ€ joins. Next, letโ€™s look at row sizes โ€” whaddaya know, you have flat files measured in single-digit kilobytes. Hadoop does best on large sets of relatively flat data. Iโ€™m sure you can create an extract thatโ€™s more denormalized.

3. Creating data ponds. On your way to creating a data lake, you took a turn off a different overpass and created a series of data ponds. Conwayโ€™s law has struck again and youโ€™ve let each business group not only create their own analysis of the data but their own mini-repositories. That doesnโ€™t sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data. I donโ€™t mean flat versus cube โ€” I mean different answers for some of the same questions. Schema-on-read doesnโ€™t mean โ€œdonโ€™t plan at all,โ€ but it means โ€œdonโ€™t plan for every question you might ask.โ€

Nonetheless, you should plan for the big picture. If you sell widgets, there is a good chance someoneโ€™s going to want to see how many, to whom, and how often you sold widgets. Go ahead and get that in the common formats and do a little up-front design to make sure you donโ€™t end up with data ponds and puddles owned by each individual business group.

4. Failing to develop plausible use cases. The idea of the data lake is being sold by vendors to substitute for real use cases. (Itโ€™s also a way to escape the constraints of departmental funding.) The data-lake approach can be valid, but you should have actual use cases in mind. It isnโ€™t hard to come up them in most midsize to large enterprises. Start by reviewing when someone last said, โ€œNo, we canโ€™t, because the database canโ€™t handle it.โ€ Then move on to โ€œduh.โ€ For instance, โ€œbusiness developmentโ€ isnโ€™t supposed to be just a titular promotion for your top salesperson; itโ€™s supposed to mean something.

What about, say, using Mahout to find customer orders that are common outliers? In most companies, most customer orders resemble each other. But what about the orders that happen often enough but donโ€™t match common ones? These may be too small for salespeople to care about, but they may indicate a future line of business for your company (that is, actual business development). If you canโ€™t drum up at least a couple of good real-world uses for Hadoop, maybe you donโ€™t need it after all.

5. Thinking Hive is the be-all, end-all. You know SQL. You like SQL. Youโ€™ve been doing SQL. I get it, man, but maybe you can grow, too? Maybe you should reach deep down a decade or three and remember the young kid who learned SQL and saw the worlds it opened up for him. Now imagine him learning another thing at the same time.

You can be that kid again. You can learn Pig, at least. It wonโ€™t hurt โ€ฆ much. Think of it as PL/SQL on steroids with maybe a touch of acid. You can do this! I believe in you! To do a larger bit of analytics, you may need a bigger tool set that may include Hive, Pig, MapReduce, Uzi, and more. Never say, โ€œHive canโ€™t do it, so we canโ€™t do it.โ€ The whole point of big data is to expand beyond what you could do with one technology.

6. Treating HBase like an RDBMS. You went nosing around Hadoop and realized indeed there was a database. Maybe you found Cassandra, but most likely you found HBase. Phew, a database โ€” now I donโ€™t have to try so hard! Trust me, HDFS-plus-Hive will drain less glucose from your head muscle (IANAD).

The only real commonality between HBase and your RDBMS is that both have something resembling a table. You can do things with HBase that would make your RDBMSโ€™s head spin, but the reverse is also true. HBase is good for what HBase is good for, and it is terrible at nearly everything else. If you try and represent your whole RDBMS schema as-is in HBase, you will experience a searing hot migraine that will make your head explode.

7. Installing 100 nodes by hand. Oh my gosh, really? You are going to hand-install Hadoop and all its moving parts on 100 nodes by Tuesday? Nothing beats those hand-rolled bits and bytes, huh? That is all fun and good until someone loses a node and youโ€™re hand-rolling those too. At the very least, use Puppet โ€” actually, use Ambari (or your distributionโ€™s equivalent) first.

8. RAID/LVM/SAN/VMing your data nodes. Hadoop stripes blocks of data across multiple nodes, and RAID stripes it across multiple disks. Put them together, what do you have? A roaring, low-performing, latent mess. This isnโ€™t even turducken โ€” itโ€™s like roasting a turkey inside of a turkey. Likewise, LVM is great for internal file systems, but youโ€™re not really going to randomly decide all hundred of your data nodes need to be larger, instead of, like, adding a few more data nodes.

And your SAN, your holy SAN โ€” loved by many, I/O bound, and latent to all. Youโ€™re using HDFS for a higher burst rate, so now youโ€™re going to stick everything back in the box? The idea is to scale horizontally โ€” how are you going to do that across the same network pipe to the same box oโ€™ disks?

Hey, EMC will sell you more SAN, but maybe you need to think outside the box. VMs are great. However, if you want high-end performance, I/O is king. Fine, you can virtualize the name node and much of the rest of Hadoop, but nothing beats bare metal for data nodes. You can achieve much of the same advantage as virtualization with devops tools. Even most of the cloud vendors are offering metal options.

9. Treating HDFS as just a file system. If you dump stuff onto HDFS, you havenโ€™t necessarily accomplished anything. The tooling around it is important, of course. Now you can Hive, Pig, and MapReduce it, but you have to think a bit about what, why, and where youโ€™re dumping things onto HDFS. You need to think about how youโ€™re going to secure all of this and for whom.

10. Whoo, shiney! Also known as, โ€œtoday is Thursday, letโ€™s move to Spark.โ€ Yes, Hadoop is a growing ecosystem, and you want to stay ahead of the curve. I feel you, man, but letโ€™s remember that freedom is just another word for nothing left to lose. Once you have real data and real users, you donโ€™t have the same amount of freedom as when you had no real responsibility. Now you must have a plan.

Fortunately, you have the tools to manage that evolution and move forward responsibly. Maybe you donโ€™t get to deploy this weekโ€™s cool thing while it is fresh, but you donโ€™t have to run Hadoop 1.1 anymore, either. As with any technology โ€” or anything in life โ€” find that moderate path that prevents you from being the last gazelle in the pack or the first lemming off the cliff.

This is the current top 10 Iโ€™m seeing in the field. Howโ€™s your big data project going? What anti-patterns or patterns have you found?

This article, โ€œThe 10 worst big data practices,โ€ was originally published at InfoWorld.com. Keep up on the latest news in application development and read more of Andrew Oliverโ€™s Strategic Developer blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.