by Andrew C. Oliver

Contributing Writer

Which Spark machine learning API should you use?

analysis

Jul 14, 20175 mins

AnalyticsApache SparkMachine Learning

A brief introduction to Spark MLlib's APIs for basic statistics, classification, clustering, and collaborative filtering, and what they can do for you

Navigating a field of uncertainty and doubt questions

You’re not a data scientist. Supposedly according to the tech and business press, machine learning will stop global warming, except that’s apparently fake news created by China. Maybe machine learning can find fake news (a classification problem)? In fact, maybe it can.

But what can machine learning do for you? And how will you find out? There’s a good place to start close to home, if you’re already using Apache Spark for batch and stream processing. Along with Spark SQL and Spark Streaming, which you’re probably already using, Spark provides MLLib, which is, among other things, a library of machine learning and statistical algorithms in API form.

Here is a brief guide to four of the most essential MLlib APIs, what they do, and how you might use them.

Basic statistics

Mainly you’ll use these APIs for A-B testing or A-B-C testing. Frequently in business we assume that if two averages are the same then the two things are roughly equivalent. That isn’t necessarily true. Consider if a car manufacturer replaces the seat in a car and surveys customers on how comfortable it is. At one end the shorter customers may say the seat is much more comfortable. At the other end, taller customers will say it is really uncomfortable to the point that they wouldn’t buy the car and the people in the middle balance out the difference. On average the new seat might be slightly more comfortable but if no one over 6 feet tall buys the car anymore, we’ve failed somehow. Spark’s hypothesis testing allows you to do a Pearson chi-squared or a Kolmogorov–Smirnov test to see how well something “fits” or whether the distribution of values is “normal.” This can be used most anywhere we have two series of data. That “fit” might be “did you like it” or did the new algorithm provide “better” results than the old one. You’re just in time to enroll in a Basic Statistics Course on Coursera.

Classification

What are you? If you take a set of attributes you can get the computer to sort “things” into their right category. The trick here is coming up with the attribute that matches the “class,” and there is no right answer there. There are a lot of wrong answers. If you think of someone looking through a set of forms and sorting them into categories, this is classification. You’ve run into this with spam filters, which use a list of words spam usually has. You may also be able to diagnose patients or determine which customers are likely to cancel their broadcast cable subscription (people who don’t watch live sports). Essentially classification “learns” to label things based on labels applied to past data and can apply those labels in the future. In Coursera’s Machine Learning Specialization there is a course specifically on this that started on July 10, but I’m sure you can still get in.

Clustering

If k-means clustering is the only thing out of someone’s mouth after you ask them about machine learning, you know that they just read the crib sheet and don’t know anything about it. If you take a set of attributes you may find “groups” of points that seem to be pulled together by gravity. Those are clusters. You can “see” these clusters but there may be clusters that are close together. There may be one big one and one small one on the side. There may be smaller clusters in the big cluster. Because of these and other complexities there are a lot of different “clustering” algorithms. Though different from classification, clustering is often used to sort people into groups. The big difference between “clustering” and “classification” is that we don’t know the labels (or groups) up front for clustering. We do for classification. Customer segmentation is a very common use. There are different flavors of that, such as sorting customers into credit or retention risk groups, or into buying groups (fresh produce or prepared foods), but it is also used for things like fraud detection. Here’s a course on Coursera with a lecture series specifically on clustering and yes, they cover k-means for that next interview, but I find it slightly creepy when half the professor floats over the board (you’ll see what I mean).

Collaborative filtering

Man, collaborative filtering is a popularity contest. The company I work for uses this to improve search results. I even gave a talk on this. If enough people click on the second cat picture it must be better than the first cat picture. In a social or e-commerce setting, if you use the likes and dislikes of various users, you can figure out which is the “best” result for most users or even specific sets of people. This can be done on multiple properties for recommender systems. You see this on Google Maps or Yelp when you search for restaurants (you can then filter by service, food, decor, good for kids, romantic, nice view, cost). There is a lecture on collaborative filtering from the Stanford Machine Learning course, which started on July 10 (but you can still get in).

This is not all you can do (by far) but these are some of the common uses along with the algorithms to accomplish them. Within each of these broad categories are often several alternative algorithms or derivatives of algorithms. Which to pick? Well, that’s a combination of mathematical background, experimentation, and knowing the data. Remember, just because you get the algorithm to run doesn’t mean the result isn’t nonsense.

If you’re new to all of this, then the Machine Learning Foundations course on Coursera is a good place to start — despite the creepy floating half-professor.

by Andrew C. Oliver

Contributing Writer

Follow Andrew C. Oliver on X

Andrew C. Oliver is a columnist and software developer with a long history in open source, databases, and cloud computing. He founded Apache POI and served on the board of the Open Source Initiative. Oliver has helped with marketing in startups including JBoss, Lucidworks, and Couchbase. He advises startups on marketing, growth, and outreach.

Show me more

Topics

About

Policies

Our Network

More

Which Spark machine learning API should you use?

A brief introduction to Spark MLlib's APIs for basic statistics, classification, clustering, and collaborative filtering, and what they can do for you

Basic statistics

Classification

Clustering

Collaborative filtering

More from this author

The Claude party is almost over

It’s time to completely change how data management works

What you absolutely cannot vibe code right now

OpenAI’s o3 price plunge changes everything for vibe coders

What the AI coding assistants get right, and where they go wrong

Sizing up the AI code generators

Why LLM applications need better memory management

Vibe code or retire

Show me more

Databricks adds Data Science Agent to automate analytics tasks

Rust Innovation Lab launched, sponsors first project

PostgreSQL 18 to boost OLTP performance, but misses AI readiness

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust