Machine Learning in Spark with MLLib: Beyond Word Count

Posted by Clarity Insights

Apache Spark™ brings great improvements in speed compared to traditional Hadoop MapReduce (checkout the Terasort challenge) and one area in which it really shines is Machine Learning, especially with Hadoop. While many introductions to Apache Spark give a demonstration using the canonical word count example, let’s dive into something a little more in-depth.

Natural Language Processing: Enter The 20 News Group Dataset
The field of Machine Learning has many great datasets that have been used to benchmark new algorithms. Here we reach for the 20 News Group data, which contains 20,000 newsgroup documents across 20 different newsgroups.  The goal: Train a model to give the category for a conversation, based only on the words in the text.  Similar Natural Language Processing techniques are used to solve a variety of complex business challenges.

The steps in this process, as in many Machine Learning problems, are Data Processing > Feature Engineering > Model Training > Prediction. Spark makes this a breeze with a clean API that linearly scales across thousands of servers.

Data Pre Processing:
For this NLP task we will preprocess the data using the Stanford NLP library. Side note: Because the Stanford NLP library is written in Java, we can use it in Scala with ease. We remove stop words and perform lemmatization, which links related words.

Feature Engineering – Wranglin’ Data:
The dataset for our model needs to be a Document Term Matrix. We could use what is known as the “Bag of Words” model, which simply uses the counts of each word in the documents, but we will use a statistical measure known as TF-IDF to try to boost accuracy. This technique uses a mathematical formula to give a number representing how important a word is in a document, taking into consideration how often the word appears in the entire corpus.

Model Training – Teaching Spark the “ABCs”:
Spark makes training our model a breeze. First we add in the Spark boilerplate, then we create a test data set and a train data set, as these will be used to see how well our model has learned from the data.

The next step is to chose from many of MLLibs classification algorithms, in this case we benchmark the Naïve Bayes model, and run it:

Spark has built in metrics to calculate the confusion matrix and various evaluation metrics for your model. How well did Spark learn?

The diagonal line shows our model predicting the actual class with a high degree of accuracy.

The Big Data Experts
Who uses Spark? Some of the most cutting edge companies in the world. Clarity’s Spark experts build solutions to tackle the largest data challenges, across key industries. Contact us to start a conversation.

Ryan Wheeler, Senior Consultant, Clarity Solution Group

linkedin-sm2Connect with Ryan Wheeler on LinkedIn






Written by Clarity Insights

Clarity Insights is a strategic partner to the nation's leading data-driven brands.

Topics: Machine Learning, Big Data

Subscribe to our Blog

Start a conversation

Read Recent Posts: