Spark is the hottest project in big data -- but Databricks, the company behind it, needs to ensure its implementation has a plausible path to maturity
Spark is on the ascent in the big data world and rightfully so. Itβs faster than MapReduce by far, and with its SQL interface, itβs faster than Hive. Though operationally different than either of the two, Spark can replace both in many instances.
The company behind Spark, Databricks, hopes to carve out a niche for itself in the big data world. Yet all of the major Hadoop vendors have announced support for Spark as well. At the recent Spark Summit East, I asked Databrickβs head of customer engagement, Arsalan Tavakoli, how the company plans to compete:
It is really two different segments. I think the Hadoop ecosystem is alive and kicking. Hortonworks, MapR, Cloudera are all very focused in the on-premise world. We donβt have a distribution of Spark in the on-premise world. Actually, all of those guys leverage databricks for their L2, L3 support for Spark. When they go to a customer and sell Spark support, they rely on our expertise because we have the core braintrust around that.
This is rosy if not well-rehearsed answer to the question, but the truth is more complicated. Paco Nathan, Databricksβ director of community engagement, made several unfavorable references to Hadoop during a Databricks cloud training session at Spark Summit East. He stated that he saw several companies βjumping over Hadoopβ and βskipping the big Yarn deployβ to go straight to Spark. He went further to say that Hadoop would be over in a few years.
What does Databricks sell exactly?
According to Tavakoli, βWhen we built the company, we said two things: Our focus is, one, entirely on the cloud, and two, itβs about something broader than βhere is an open source product and weβre going to wrap some professional services around it.ββ
Translation: The company has a cloud-based, Spark-based platform that uses the concept of a βnotebookβ in which you write both markup and code in what amounts to a Web page, then βexecuteβ the notebook across the cluster. It looks like Interwoven Teamsite (an old, fat CMS) ate iPython Notebook but forgot about security.
Databricks The Databricks Cloud uses the metaphor of a notebook, which amounts to a Web page containing markup and code that you βexecuteβ across a cluster.
You can embed HTML, SQL, Python, and Scala in a notebook, then store the notebook in a folder. You canβt, however, secure a folder or notebook, which was demonstrated comically during introductory training at the conference. Someone didnβt pay close attention to the instructions; rather than copy the course material to their own folder, they edited the instructorβs copy, introduced garbage, and made it so that only we βadvanced studentsβ could complete the lesson.
Your notebooks stay in and are executed across the cloud. According to Tavakoli, unlike with a typical SaaS multitenanted architecture, Databricks deploys as a fully managed service inside a virtual private cloud. The product is currently on Amazon, but it will be available on other clouds.
The product is far from mature. During the training, I watched the product stack trace. It also had a really annoying habit of saying your page was executing, only to hang and fail to return the results, so you had to refresh. Admittedly, this might have been due to the crappy hotel Wi-FI β but if so, the page should notice a bum connection, which didnβt always seem to be the case. The lack of folder permissions, version control, and other βIβm not working with one other personβ features are going to be essential for Databricksβ cloud to reach the companyβs sales targets.
Looking ahead
The company sees βsolutionsβ as the future. Everyone is supposed to say that even if theyβre a platform company. According to Tavakoli:
You donβt want to just say, hey itβs great, I got a big data platform and deployed a BI tool and ETL. I deployed these things that I [ascribe] real business value to. Thatβs something that I feel really hindered the Hadoop ecosystem and big data so far. Our goal is to get more and more to those solutions, but do it a way that is more productized and automated rather than you brought an army of 1,000 consultants to build you a custom solution so you could only do one or two.
This is a long way from the product Databricks has today. The Databricks Cloud is really a platform for great mathematicians who can do crappy coding or people who have more love for Python than sense. It is far from the Tableau of data science.
By Tavakoliβs math, with the companyβs 3,500-person waiting list and his estimate of maybe 1,000 to 1,400 paid Hadoop installations worldwide, the future is bright. But a waiting list and dollars arenβt the same. Moreover, as a strategy, Databricks is counting on two things for now: It hired the brains behind Spark, who are all tied together by academic relationships at MIT and Berkeley β and everyone plays nice.
The first is indeed a challenge. The second inevitably falls apart as soon as Hortonworks or Cloudera loses a big deal and calculates that coming up with its own βnotebookβ and building its own Spark team is a better solution than relying on Databricks. Meanwhile, Google has Dataflow (which competes with part of the Databricks product) and Google Docs. If Databricks gets traction, why not put the two together and compete directly?
The real question is the viability of the βsolutionsβ vision, where a marketing manager can use machine learning against a big data cluster without becoming a mathematician. To turn that dream into reality, is your most appropriate commercial entry into the market a tool that lets mainly Python developers embed code into an HTML page and execute it across the cluster?
I think it is clear that Spark will do well. Itβs also possible that Databricks Cloud will grab a decent niche market, but Iβll be watching closely for a pivot in this companyβs future.


