Clarity Insights Blog

Highlights from the 2017 Spark Summit East

Posted by Tripp Smith | Feb 20, 2017 10:00:21 AM

boston image.jpg
*An Image of the weather conditions in Boston during Spark Summit East 2017

This February’s Spark Summit East 2017 in Boston highlighted exciting new features in core Apache® Spark™, evolving trends in the big data market, and innovative use cases for one of the most popular open source projects in big data analytics. Despite difficult weather conditions and crowds celebrating Super Bowl Champion New England Patriots, more than 1,500 attendees participated in the event.

Databricks CTO Matei Zaharia kicked off the first day keynotes with his talk, “What to Expect for Big Data and Apache Spark in 2017”, an overview of how the Apache Spark platform, user base, and deployment models have evolved over the past year.[1]

Project Tungsten Performance Gains

Changing dynamics in performance tradeoffs between network, disk, and compute have challenged prior conventions for big data processing and analytics. While network and storage performance efficiency have grown by 10x or more over the past decade, CPU efficiency has remained relatively flat. In 2016, Apache Spark Project Tungsten has focused on increasing CPU efficiency for the Spark platform by optimizing jobs for CPU, cache, and memory.

One of the bottlenecks Tungsten tackles is inefficient JVM garbage collection. Tungsten takes advantage of off-heap memory management to leverage Spark’s underlying knowledge about the life cycle of memory blocks to manage data explicitly by converting operations to operate directly against binary data. This brings Spark’s execution engine closer to bare metal performance by leveraging sun.misc.Unsafe, allowing Spark to operate directly on data structures in memory more like C.[2]

Another bottleneck Tungsten tackles is CPU time spent waiting for data to be fetched from main memory. Cache-aware computation makes more effective use of L1/ L2/L3 CPU caches. These are the high performance memory caches that the CPU uses to buffer data without having to fetch from slower main memory. Tungsten optimizes operations like aggregations, sorting, and joins by colocating keys in the binary data format in memory to reduce cache misses and thrashing between RAM and cache.

Finally, Tungsten enables dynamic optimized JVM bytecode generation for operations on dataframes. This avoids the overhead of interpreting instructions row by row and instead optimizes instructions in high performance native code before processing the entire dataframe. This has also enabled other speedups within the shuffle process that passes data across the network through stages in a processing pipeline.

#Streaming is the Word of the Day

A major theme of the summit was increased focus on streaming and low-latency, fast first deployment models. In an onsite interview session, analysts from Wikibon™ predicted that the market for streaming analytics will grow at a 42% CAGR between 2014 and 2026, outpacing the growth for overall big data market by a factor of 3x.[3] Methodology for this estimate notwithstanding, the prevalence of streaming use cases in both the keynotes and the sessions was clearly apparent.

Structured Streaming is a new Apache Spark feature that enables near seamless portability of transformation logic from batch data to application in continuously updating streaming contexts. Michael Armbrust’s keynote “Insights Without Tradeoffs: Using Structured Streaming in Apache Spark” illustrated the new capabilities in a Databricks notebook, integrating historical data with newly arriving streaming data and visualizing the continuously updating results in a live demo, around 15 steps. This highlights the efficiency breakthroughs of #FastFirst processing, eliminating the need to manage two code bases, one batch and one low latency, to deliver timely and accurate results.[4]

Other streaming breakthroughs in 2016 include better streaming integrations with other systems, e.g., JDBC source and sink, Kafa integration, and stateful processing mapWithState. The new mapWithState operator enables efficient processing of continuous updates from a stream, primarily by avoiding unnecessary processing for keys where no new data has arrived. This speeds up stateful processing in Spark with 6X lower latency and maintaining state for 10X more keys than prior implementations. This enables 8X lower processing times allowing lower end-to-end latencies.[5]


Spark Summit East 2017 provided a look back at some major enhancements in the core Apache Spark platform, demonstrated eye popping new features in Databricks' platform, and showcased a promising year to come in 2017.

Related Links

[1] Spark Summit East 2017: Another Record-Setting Spark Summit - https://databricks.com/blog/2017/02/09/spark-summit-east-2017-another-record-setting-spark-summit.html

[2] Project Tungsten: Bringing Apache Spark Closer to Bare Metal - https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

[3] Wikibon Big Data Market Update Pt. 1 – Spark Summit East 2017 – #sparksummit – #theCUBE - http://www.codingtweaks.com/wikibon-big-data-market-update-pt-1-spark-summit-east-2017-sparksummit-thecube/

[4] Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1 - https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html

[5] Faster Stateful Stream Processing in Apache Spark Streaming - https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html

Read the full Blog from Tripp's Smith on LinkedInlinkedin-icon.png

Topics: Big Data, Analytics, Apache Spark

Written by Tripp Smith