Mahout is a vibrant machine learning project that is now riding Spark instead of MapReduce for the algorithmically inclined
My tough life required me to fly to Miami and attend ApacheCon. I happened across a talk by Trevor Grant, an open source technical evangelist for the financial services sector, on Mahout. I thought, βWait, isnβt Mahout dead?β Apparently not. In fact, Mahout is very much alive, nothing like what you once knew of it, and now running on GPUs.
Mahout was the original machine learning framework for Hadoop. When MapReduce was the thing, Mahout was the vaunted elephant rider. But then, as Grant recalls, βMahout 0.09 released and all the Hadoop vendors froze at 0.09+. It was 0.09 with some bug patches. No one ever bumped up to 0.10.β
Nonetheless, the Mahout project is still active. βA lot of the projects have people paid to work on them, but Mahout doesnβt. Weβre like a bunch of gypsies that wander around in companies like the MapRs of the world,β Grant says. βAll the Mahout and former Mahout people are in very, very high places in Fortune 500 companies or CTOs of startups, but we donβt have a company of our own. Lucidworks is the closest thing. I didnβt realize but there are a lot of Mahout committers and PMCs [project management committees] kind of lurking about at Lucidworks.β (Full disclosure: I didnβt realize this either, even though I work for Lucidworks. βAO.)
The advantages of the Mahout you donβt know
Under the guidance of those βgypises,β Mahout developed some unique advantages. First, it was made engine-neutral. Although Spark is the recommended engine, Mahout supports other engines and bindings to your own favored engine.
βGPU integration is the other big huzzah, the big sexy thing that weβve got going on right now,β Grant says. You can accelerate Spark, Flink, or any JVM-distributed engine; you get GPU acceleration for free. This is a big win.β
Unlike other tools, βMahout is primarily about writing your own algorithms quickly and efficiently and mathematically expressively so you can read and other people can see what youβve doneβand the code makes sense,β Grant says. If MLlib has exactly what youβre looking for, great. But if it doesnβt, youβll find it difficult to extend. On the other hand, while working on your algorithms in Python or R, you may find that Python isnβt so great in production. Mahout gives you Scala that you write with paradigms more familiar to R or Python.
Grant says Mahoutβs βquintessential use case is that you read an academic journal article on Monday morning. You spend Monday afternoon grokking it and how it works. On Tuesday, you open up Mahout and start implementing the algorithm. By Tuesday afternoon, the algorithm is working, and youβre testing and making sure it works the way you think it is going to. On Wednesday morning, youβre writing docs and unit tests. On Wednesday afternoon, you have an algorithm in production.β
Another advantage of Mahout is its integration with Zeppelin, which lets you also use R and Python visualization tools like Ggplot2 or Pyplot rather than rolling your own visualization. If youβre playing with your data and algorithms, having visualization tools available rather than starting from scratch in Scala is important.
An example of what Mahout is really good at
If youβre starting out and looking to learn, Mahout has a few interesting βhello worldβ-style tutorials. Grant says that βthe βhello worldβ of Mahout is ordinary least-squares regression. Itβs an algorithm, but it is still a fairly simple one. Itβs easy and well documented. In maybe six to 12 lines of code you can implement ordinary least-squares regression in Mahout.β
But once youβve gotten that far, Mahout has βanother really good one that youβve probably seen elsewhere is an alternating least-squares (ALS)-based recommender tutorial. The problem with ALS is that it is single-modal. Itβs [based on] ratings, and you can make an adjustment on similar ratings from another person with a similar rating,β Grant says. βThatβs great except in the real world you have a lot more information like user profile data, age, gender, and viewing habits. Youβre throwing a lot of that out when youβre single-modal.
βALS is just a matrix factorization, and we definitely have something to do that matrix factorization. But we also have correlated co-occurrence algorithms that are multimodal. So, for example, they all need to have the same user space but letβs say your primary action is buying a product but we also have information about page views. You viewed a bunch of products and added to products to cart. There a bunch of things that are product-focused, but we also have your gender, location, total lifetime buy, and favorite color or whatever. [From all that, Mahout] will generate recommendations about correlated co-occurrences, and youβre capturing all of that much richer set of information.β
If you might port engines one day (Spark isnβt forever) or need to write your own or tweak your algorithms and want GPU accelerationβand you want to do this in a maintainable way at scaleβmaybe Mahout is your easy rider. However, youβll need to download your own copy rather than use the rusty one in your favorite Hadoop distribution. (The Mahout people would like you to forget they ever knew what MapReduce is.)
Mahout isnβt dead; it is a vibrant project that is now riding Spark instead of MapReduce. As a developer, Iβll probably wait around until someone else writes all the algorithms. But if I were more βmathy,β Iβd be taking a hard look at Mahout right now.


