Once upon a time, if you wanted a multipurpose platform for analytics, SQL databases were the way to go. This was doubly true for big data, where volumes swelled beyond the capacity of a single machine to handle analytics on such a scale. Big data was simply too big to pass across the network, and data proximity was a fundamental assumption for MPP architecture. The concept of randomized distribution without consideration for the most common access paths was unthinkable.
As a result, big data architecture solutions converged on the MPP SQL database as the multipurpose cure-all for the majority of analytic applications, combining the storage capabilities of the database with the analytic capabilities of SQL. In practice, this has its efficiencies, but it also creates complications. The database EDW becomes a single point of truth for enterprise data, and a point of proliferation to downstream analytics application where SQL is no longer the ideal toolset.
Building Enterprise Data Platforms on Cloud Infrastructure
The emergence of new cloud architectures and infrastructures for big data offers a response that challenges conventional thinking about big data and collocated storage and compute.
What’s at Stake?
The wave of innovation in big data applications over the past decade has transformed analytics through an enormous variety of analytic access paths that don't require SQL or conventional relational structures. Apache Spark™, currently the most active open source big data framework, supports memory oriented data frame APIs, GraphX graph processing frameworks, MLlib machine learning APIs, and streaming APIs, as well as SQL. Each of the Apache Spark™ components can utilize storage resources independently from compute resources, and it also supports a wide variety of storage platforms, not just Apache™ Hadoop®-native HDFS™.
For most users, SQL is still the access method of choice for big data analytics. It enables them to transform and analyze data declaratively, decoupling business logic from the programmatic implementation used for execution. The success of engines like Apache Hive™, Apache Spark™ SQL, and others has proven the value of SQL as a language for interacting with data without depending on collocated storage, and in so doing has proved that decoupling storage from compute does not compromise the ability to support enterprise analytic SQL workloads.
But proving that decoupling does not affect performance is only half the story; it also needs to bring measurable improvements in system economy and efficiency. The myriad forms of modern data analytics workloads — latency constrained backend preparation, ad hoc user exploration, computationally intensive machine learning, real time endpoint interactions — all imply a dedicated, continuous fixed capacity infrastructure that is inefficient to maintain.
Inefficiencies can also arise from the maintenance of the enterprise single point of truth within a single general purpose relational database with converged compute and storage. This encourages architectural antipatterns:
- The constant presence of over or under provisioning relative to typical and peak workloads
- Increased cost of storage on analytic MPP platforms when data volumes grow disproportionately to analytic workloads (our research showed a variance in some cases of as much as 300:1 by comparison with independent storage.
- Proliferation to downstream applications that would be better suited to specific analytic applications or storage structures
- Constraints on innovation or user interaction with a single access path
- Artificial infrastructure constraints on data access and lifecycle management, potentially compromising data lineage
The alternative is a flexible infrastructure that is elastic to the demands of these storage and computational requirements as they shift over time.
Evolving Economics Challenge Conventional Architecture—The Wider Industry Context
Moore's Law, the longstanding assertion that compute capacity available on a CPU roughly doubles every 12-18 months, is encountering physical limitations that will restrict the ability to scale up compute, creating an even higher demand for cluster computing.
Simultaneously, Gilder's Law, which states that network bandwidth roughly doubles every six months, continues to hold true. As we begin to see gigabit Ethernet replaced by 10-40 GbE infrastructure and enhanced network architectures, network is no longer the critical bottleneck to big data analytics. The consequent faster transfers from storage to compute tiers eliminates the need for collocated storage and compute, making decoupling a natural next step.
Benchmarks are demonstrating that separate compute and storage can outperform dedicated MPP relational database platforms, even for conventional SQL workloads. Cloudera™ recently proved that by leveraging S3 storage, Impala™ can outperform cloud native MPP Redshift™ on a variety of performance, availability, and scalability metrics. These trends are unlikely to change.
Conventional MPP analytic database vendors have been quick to recognize and adapt to this trend. Best in class petabyte scale MPP vendors like Vertica™ have announced roadmaps embracing the separation of compute and storage, enabling a decoupling of integration from persistence from analytics for end to end pipelines. Cloud PaaS vendors like Microsoft are integrating decoupled concepts into their offering, with suites of components like Azure® Data Lake Store, Azure® SQL Data Warehouse, and PolyBase that federate across storage and compute for a variety of workloads and latency requirements. Others will follow as the trend gathers pace.
Putting it All Together
Managing the enterprise single point of truth within a distributed file system with immutable storage requires different integration patterns from managing data in a typical MPP database. Building repeatable patterns for data management processes optimized for file oriented storage simplifies the transition from ACID relational databases. Storage management techniques such as partitioning, compression, and columnar storage should be considered to optimize performance for both integration and access.
The key to selecting complementary components to support analytic applications is to consider the last mile to the endpoint where analytics will be consumed. For example, if consumers will interface through a BI tool, the last step in an integration workload may be to populate a cube or data frame in memory directly instead of copying data to a relational database as an intermediate step. With SQL separated from storage, ephemeral SQL compute engines or solutions that enable access to external storage will eliminate both the need for additional steps to transfer data from the storage platform and any redundant persistent copies of the data.
When To Consider Decoupling?
A few key dimensions inform how aggressively you should pursue a separate compute and storage architecture:
- Analytic maturity: Is your organization struggling to deliver fundamental data governance, data quality, and descriptive reporting, or are you currently supporting a variety of analytic workloads, applications, interfaces, and endpoints? Decoupling maintains consistency while providing data for a variety of applications without proliferating data or business logic.
- Dynamism of requirements: Are your business requirements stable, or are they evolving with consistent requests for new enrichments and projections into new business views? Decoupling accelerates cycle time by enabling syndication and federation with sandbox environments that may have variable requirements over a fixed duration.
- Workload variety: Do you have stable, balanced workloads or do you have predictable or unpredictable spikes? Decoupling enables elastic resource provisioning for high priority workloads as needed, deploying multiple analytic engines, SQL, ML, Graph, etc., ephemerally to access the centrally managed data.
- Rapid data growth: Is your data growing in line with compute requirements, or do your storage needs for operational, archival, and analytic data outpace compute requirements? Decoupling offers independent scalability and access to underlying storage without adding costly compute resources.
- Security and governance: Are your workload requirements compatible with your security requirements within the context of an SQL database, or do you need to enforce strong security across a breadth of use cases? Are you forced to make data governance or lifecycle decisions based on cost constraints vs. business value? Decoupling enables a single ecosystem and consistent approach to security, providing transparency to data usage and lifecycle within a platform that is performant and cost effective for raw, warm, governed, and archival data according to lifecycle requirements.