Spark has dethroned MapReduce and changed big data forever, but that rapid ascent has been accompanied by persistent frustrations
Apache Spark is the word. OK, technically thatโs two, but itโs clear that in the last year the big data processing platform has come into its own, with heavyweights like Cloudera and IBM throwing their weight and resources behind the project as we gradually say farewell to MapReduce.
Weโve all seen the Spark demonstrations where people write word count applications in fewer than 10 lines of code.ย But if youโve actually dived into Spark with abandon, you might have discovered that once you start working on something larger than toy problems, some of the sheen comes off.
Yes, Spark is amazing, but itโs not quite as simple as writing a few lines of Scala and walking away. Here are five of the biggest bugbears when using Spark in production:
1. Memory issues
No, Iโm not talking about the perennial issue of Spark running out of heap space in the middle of processing a large amount of data. (Project Tungsten, one of Databricksโ main areas of focus in Spark 1.5 and the upcoming 1.6, does a lot here to finally relieve us from the scourge of garbage collection.) Iโm talking about the myriad other memory issues youโll come across when working at scale.
It might simply be the whiplash you get when switching from using Spark in Standalone cluster mode for months, then moving to YARN and Mesosย โย and discovering that all the defaults change. For example, instead of grabbing all available memory and cores automatically in Standalone, the other deployment options give you terrifyingly tiny defaults for your executors and driver. Itโs easy to fix, but youโll forget at least once when spinning up your job, Iโll bet.
When you move beyond demos and into large data sets, youโll end up blowing up your Spark job because the reduceByKey operation you do on the 1.8TB set exceeds the default in spark.driver.maxResultSize (1GB, if you were wondering). Or maybe youโre running enough parallel tasks that you run into the 128MB limit in spark.akka.frameSize.
These are fixable by altering the configuration โ and Spark does a lot better these days about pointing them out in the logs โ but it means the โsmartphone of dataโ (as Denny Lee of Databricks described Spark earlier this week) requires lots of trial and error (problematic for potentially long-running batch jobs). Spark also demands arcane knowledge of configuration options. Thatโs great for consultants, not so much for everybody else.
2. The small files problem โฆ again
If youโve done any work with Hadoop, youโve probably heard people complaining about the small-files problem, which refers to the way HDFS prefers to devour a limited number of large files rather than a large number of small files. If you use Spark with HDFS, youโll run into this issue. But thereโs another modern pattern where this is lurking, and you might not realize it until it hits you:
Yeah, so we store all the data gzipped in S3.
This is a great pattern! Except when itโs lots of small gzipped files. In that case, not only does Spark have to pull those files over the network, it also has to uncompress them.ย Because gzipped files can be uncompressed only if you have the entire file on one core, your executors are going to spending a lot of time simply burning their cores unzipping files in sequence.
To make matters worse, each file then becomes one partition in the resulting RDD, meaning you can easily end up with an RDD with more than a million tiny partitions. (RDD stands for โresilient distributed dataset,โ the basic abstraction in Spark.) In order to not destroy your processing efficiency, youโll need to repartition that RDD into something more manageable, which will require lots of expensive shuffling over the network.
Thereโs not a lot Spark can do here. The best fix is to get the data compressed in a different, splittable format (for example, LZO) and/or to investigate if you can increase the size and reduce the number of files in S3 somehow.
3. Spark Streaming
Ah, Spark Streaming, the infamous extension to the Spark API. Itโs the whale that turns many a developer into Ahab, forever doomed to wander the corridors muttering โif only I can work out the optimal blockInterval, then my pipeline will stay up!โ to themselves with a faded glint in their eye.
Now, itโs incredibly easy to stand up a streaming solution with Spark. Weโve all seen the demos. However, getting a resilient pipeline that can operate at scale 24/7 can be a very different matter, often leading you down into some very deep debugging wells. Again, to Sparkโs credit, each release is making this easier, with more information made available at the SparkUI level, direct receivers, ways of dealing with back-pressure, and so on. But itโs still not quite as simple as all those conference presentations would make it look.
If youโre looking for some help debugging your Spark Streaming pipeline โ or deciding when you should consider switching to Apache Storm instead โ check out these two talks I recently gave: An Introduction to Apache Spark and Spark & Storm: When & Where?
4. Python
Before all you Python fans get out your pitchforks โ I like Python! I donโt mean to start a programming language war. Honest! But unless thereโs a pressing need to use Python, I normally recommend that people write their Spark applications in Scala.
There are two main reasons for this. First, if you follow Spark development, youโll soon see that every release brings something new to the Scala/Java side of thingsย and updates the Python APIs to include something that wasnโt exposed previously (this is true to an even greater extent with the SparkR bindings). You will always be at least a step or two behind what is possible on the platform. Second, if youโre using a pure RDD approach in writing your application, Python is almost always going to be slower than a Java or Scala equivalent. Use Scala! Embrace the type-safety!
But if you need things in numpy or scikit-learn that simply arenโt in Spark, then yes, Python definitely becomes a viable option again โ as long as you donโt mind being a little behind the Spark API curve. Hey, back off with that pitchfork.
5. Random crazy errors
What sort of crazy errors? For instance, on a recent engagement, I had a Spark job that had been working fine for over a week. Then, out of nowhere, it stopped. The executor logs were full of entries that pointed to compression/decompression errors during the shuffle stages.
Thereโs an open ticket in Sparkโs Jiraย log that blames this on the Snappy compression scheme used during the shuffles. Oh, and the ticket points out itโs intermittent.
I flipped to a different codec and all was fine โ until the next morning, whereupon I got similar errors. I then spent that day flipping between different shuffle codecs and even turning off the compression entirely, but to no avail. Eventually, I tracked down the issue to the interaction of Sparkโs network transport system (Netty) with the Xen hypervisor and the version of the Linux kernel that we were using on our AWS instance (say that three times fast). The fix ended up setting a flag on the Xen network drivers, and everything magically worked like nothing had ever been wrong. It was a very frustrating experience, but at least it had a happy ending.
The moral of these tales? Although Spark makes it easy to write and run complicated data processing tasks at a very large scale, you still need experience and knowledge of everything from the implementation language down to the kernel when you start operating at scale and things go awry. I know, because a significant part of my business is devoted to helping people out of these kinds of jams.


