Data Lake vs Data Warehouse? In our experience, people often confuse the two approaches to data management. They either conflate them or misunderstand how they are alike, yet different. Rather than deal in the abstract, we thought it would make sense to have the data lakes vs. data warehouses discussion using the example of data analytics in insurance claims management. It’s a category we know well.
In theory, insurance claims management is relatively predictable. Experienced claims managers can usually make an accurate assessment of the potential loss from a claim early in the process. They can apply various “rules of thumb” to the facts presenting themselves in the claim. However, on occasion there are the surprises, which no one likes, but which can result in huge, unforeseen payouts.
Surprise claims are difficult to predict when viewing from only one perspective. That’s why they’re surprises. Now though, advances in data analytics are giving claims managers more insights—better rules of thumb, so to speak. Having worked extensively in the insurance industry, we see improvements in claims management through the use of both data lakes and data warehouses.
The search for a better approach to analysis is not new. The insurance industry has been plumbing claims data for decades looking for patterns and insights that will reduce risk and provide protection from unpredictability. What’s different now are the potential scale of data available for analytics and the approaches taken by insurance data architects to make that data available.
A Hypothetical Insurance Claim
Data is forcing a change in the old rules of thumb. Consider a hypothetical claim that involves, for the sake of argument, an actual thumb. Imagine that an auto policy holder breaks his thumb in a minor car accident. What are the expected and potential losses to the carrier from that claim? Well... it really depends. Most of the time, this would be a relatively small claim. There might be some costs associated auto repair, medical treatment and physical therapy.
However, in certain cases, that may be difficult to identify without the aid of additional information, a simple broken thumb in a car accident can cost an auto insurance carrier significantly more than expected. Why? There may be virtually an unlimited number of reasons, reasons that may be gleaned through analysis of the data. By analyzing hundreds of separate variables from many years of accident data, a data analytics engine has the potential to identify hidden patterns that predict unexpected claim outcomes.
Factors such as distance from the driver’s home, time of day, weather, surrounding neighborhoods, traffic and violation patterns where the crash occurred, details of the other car in the accident and many other hidden data points may influence the loss amount. The amount of data and number of variables is beyond the limit of analysis for even the most experienced claims analyst. Insurance claims managers are now being provided access to tools that leverage advanced analytical models to provide probabilities for high loss claims and act accordingly. To make it work, though, the tools need to first ingest, store, manage and then analyze the data itself. This is where the data lake and data warehouse architectures come into play.
Data Lake vs Data Warehouse in the Broken Thumb Case
To make informed decisions and take appropriate action, the insurance company needs data—a lot of data to make its claim predictions. Data related to the accident itself as well as ancillary data, such as that listed above. This data may come in many forms. For capturing such extensive data volume and variety, a data lake provides an ideal platform. As its name suggests, a data lake is a massive pool of original data, usually built using platforms like Hadoop. It’s not highly structured or transformed, and that’s sort of the point.
Data lakes generally store data in its native structure. The insurance company can grab any and all data related to car accidents and quickly populated the data lake. Data loads into the lake from their original source systems, which may include public accident data from the government, weather reports, automotive telemetry data, driver demographic studies, geographic data and vehicle safety research as well as claims data from the insurance company itself. The data lake turns away nothing. All data is retained. With more flexible access, the insurance company can easily go through the data lake to find the most unlikely data points that explain unprecedented claim losses.
Without the need to perform extensive transformations, data can be loaded faster than is possible with more structured approaches like the data warehouse. The data lake applies “schema-on-read,” which lets the schema be developed and adapted on a case-by-case basis as required at the time of usage. The data lake stores data at the leaf level. It remains in either a transformed or untransformed state. The analyst or data scientist can create schema data sets in the data lake for a given use case, determine its effective value and retain for further use or discard as appropriate.
In contrast, a data warehouse is generally architected around a centralized repository(s) of integrated data and stored in a relational manner. The data warehouse, focuses upon the enterprise or business view of data to better understand events and actions in a business context. Data may be sourced from a variety of locations, but is generally driven from operational systems and enhanced with external data as required. With a planned and structured architecture, the concept of “schema on write,” contrasts with the data lake as this implies planning and structure where data is often cleansed and transformed to meet the specific requirements of the data warehouse. In contrast to the data lake, the loading of the data warehouse takes time, both for the design of the structure and transformations as well as the transformations themselves when going through the load processes.
Data Lakes and Data Warehouses Work Together
While newer in terminology, the data lake is not a replacement for the data warehouse. They are complimentary to each other. The data lake provides a foundation for users and analysts to review, study, model and test theories and make use of a wide variety of data. This allows for analyzing and validating business rules before investing time to productionalize them in the Data Warehouse. In our insurance claim example, both are needed in order for the analysts to make sense of all the accident data and arrive at a practical solution for claims managers. One that provides the platform for the broad analysis and testing and then sends that output to the data warehouse, which informs that agent on which claims to focus their resources.
The data lake and the data warehouse each provide unique opportunities, but for unique sets of users. The data lake is relatively unstructured and usually requires depth of technical and SQL skills to successfully navigate usually by technical or advanced analysts. The data warehouse on the other hand, is more suitable for the general business users and analysts, that are focused upon reporting and historical analysis.Data lakes and data warehouses are both elements of an enterprise information strategy. The insurance claims example shows how the two technologies working together can lead to better business processes and outcomes. We have worked with many companies on the design and implementation of both data warehouses and data lakes.