A brief introduction to Spark MLlib's APIs for basic statistics, classification, clustering, and collaborative filtering, and what they can do for you
Youโre not a data scientist. Supposedly according to the tech and business press, machine learning will stop global warming, except thatโs apparently fake news created by China. Maybe machine learning can find fake news (a classification problem)? In fact, maybe it can.
But what can machine learning do for you? And how will you find out? Thereโs a good place to start close to home, if youโre already using Apache Spark for batch and stream processing. Along with Spark SQL and Spark Streaming, which youโre probably already using, Spark provides MLLib, which is, among other things, a library of machine learning and statistical algorithms in API form.
Here is a brief guide to four of the most essential MLlib APIs, what they do, and how you might use them. ย
Basic statistics
Mainly youโll use these APIs for A-B testing or A-B-C testing. Frequently in business we assume that if two averages are the same then the two things are roughly equivalent. That isnโt necessarily true. Consider if a car manufacturer replaces the seat in a car and surveys customers on how comfortable it is. At one end the shorter customers may say the seat is much more comfortable. At the other end, taller customers will say it is really uncomfortable to the point that they wouldnโt buy the car and the people in the middle balance out the difference. On average the new seat might be slightly more comfortable but if no one over 6 feet tall buys the car anymore, weโve failed somehow. Sparkโs hypothesis testing allows you to do a Pearson chi-squared or a KolmogorovโSmirnov test to see how well something โfitsโ or whether the distribution of values is โnormal.โ This can be used most anywhere we have two series of data. That โfitโ might be โdid you like itโ or did the new algorithm provide โbetterโ results than the old one. Youโre just in time to enroll in a Basic Statistics Course on Coursera.
Classification
What are you? If you take a set of attributes you can get the computer to sort โthingsโ into their right category. The trick here is coming up with the attribute that matches the โclass,โ and there is no right answer there. There are a lot of wrong answers. If you think of someone looking through a set of forms and sorting them into categories, this is classification. Youโve run into this with spam filters, which use a list of words spam usually has. You may also be able to diagnose patients or determine which customers are likely to cancel their broadcast cable subscription (people who donโt watch live sports). Essentially classification โlearnsโ to label things based on labels applied to past data and can apply those labels in the future. In Courseraโs Machine Learning Specialization there is a course specifically on this that started on July 10, but Iโm sure you can still get in.
Clustering
If k-means clustering is the only thing out of someoneโs mouth after you ask them about machine learning, you know that they just read the crib sheet and donโt know anything about it. If you take a set of attributes you may find โgroupsโ of points that seem to be pulled together by gravity. Those are clusters. You can โseeโ these clusters but there may be clusters that are close together. There may be one big one and one small one on the side. There may be smaller clusters in the big cluster. Because of these and other complexities there are a lot of different โclusteringโ algorithms. Though different from classification, clustering is often used to sort people into groups. The big difference between โclusteringโ and โclassificationโ is that we donโt know the labels (or groups) up front for clustering. We do for classification. Customer segmentation is a very common use. There are different flavors of that, such as sorting customers into credit or retention risk groups, or into buying groups (fresh produce or prepared foods), but it is also used for things like fraud detection. Hereโs a course on Coursera with a lecture series specifically on clustering and yes, they cover k-means for that next interview, but I find it slightly creepy when half the professor floats over the board (youโll see what I mean).
Collaborative filtering
Man, collaborative filtering is a popularity contest. The company I work for uses this to improve search results. I even gave a talk on this. If enough people click on the second cat picture it must be better than the first cat picture. In a social or e-commerce setting, if you use the likes and dislikes of various users, you can figure out which is the โbestโ result for most users or even specific sets of people. This can be done on multiple properties for recommender systems. You see this on Google Maps or Yelp when you search for restaurants (you can then filter by service, food, decor, good for kids, romantic, nice view, cost). There is a lecture on collaborative filtering from the Stanford Machine Learning course, which started on July 10 (but you can still get in).
This is not all you can do (by far) but these are some of the common uses along with the algorithms to accomplish them. Within each of these broad categories are often several alternative algorithms or derivatives of algorithms. Which to pick? Well, thatโs a combination of mathematical background, experimentation, and knowing the data. Remember, just because you get the algorithm to run doesnโt mean the result isnโt nonsense.
If youโre new to all of this, then the Machine Learning Foundations course on Coursera is a good place to start โ despite the creepy floating half-professor.


