By Clarity Insights,
Clarity Insights is a strategic partner to the nation's leading data-driven brands.

Apache Spark™ brings great improvements in speed compared to traditional Hadoop MapReduce (checkout the Terasort challenge) and one area in which it really shines is Machine Learning, especially with Hadoop. While many introductions to Apache Spark give a demonstration using the canonical word count example, let’s dive into something a little more in-depth.

Natural Language Processing: Enter The 20 News Group Dataset
The field of Machine Learning has many great datasets that have been used to benchmark new algorithms. Here we reach for the 20 News Group data, which contains 20,000 newsgroup documents across 20 different newsgroups.  The goal: Train a model to give the category for a conversation, based only on the words in the text.  Similar Natural Language Processing techniques are used to solve a variety of complex business challenges.

The steps in this process, as in many Machine Learning problems, are Data Processing > Feature Engineering > Model Training > Prediction. Spark makes this a breeze with a clean API that linearly scales across thousands of servers.

Data Pre Processing:
For this NLP task we will preprocess the data using the Stanford NLP library. Side note: Because the Stanford NLP library is written in Java, we can use it in Scala with ease. We remove stop words and perform lemmatization, which links related words.

Feature Engineering – Wranglin’ Data:
The dataset for our model needs to be a Document Term Matrix. We could use what is known as the “Bag of Words” model, which simply uses the counts of each word in the documents, but we will use a statistical measure known as TF-IDF to try to boost accuracy. This technique uses a mathematical formula to give a number representing how important a word is in a document, taking into consideration how often the word appears in the entire corpus.

Model Training – Teaching Spark the “ABCs”:
Spark makes training our model a breeze. First we add in the Spark boilerplate, then we create a test data set and a train data set, as these will be used to see how well our model has learned from the data.

The next step is to chose from many of MLLibs classification algorithms, in this case we benchmark the Naïve Bayes model, and run it:

Spark has built in metrics to calculate the confusion matrix and various evaluation metrics for your model. How well did Spark learn?

The diagonal line shows our model predicting the actual class with a high degree of accuracy.

The Big Data Experts
Who uses Spark? Some of the most cutting edge companies in the world. Clarity’s Spark experts build solutions to tackle the largest data challenges, across key industries. Contact us to start a conversation.

Ryan Wheeler, Senior Consultant, Clarity Solution Group

linkedin-sm2Connect with Ryan Wheeler on LinkedIn






You may also like:

Machine Learning Business Intelligence Artificial Intelligence

The Rise of Continuous Intelligence and How It’s Changing Analytics and BI

Continuous intelligence, according to the Gartner report, Top 10 Data and Analytics Technology Trends That Will Change Y...

Machine Learning Data Management Artificial Intelligence

What Will Augmented Data Management Mean for Your Organization?

To understand augmented data management, it’s first necessary to know what is involved in basic data management. “Data m...

Machine Learning Artificial Intelligence

Oh, the Irony...How Commercial AI Vendors are Democratizing Data Analytics

You have to have a well-honed sense of irony if you work in the world of data analytics and Artificial Intelligence (AI)...