Operational Data Lakes…a Contradiction in Terms?
The Data Lake was introduced as an answer to the problem of having important data locked up in silos of production applications and departmental databases. The assumption was that loading all that data into a single structure would be able to reveal important insights about the operation of a company. The trend was amplified by the fact that Hadoop made it possible to store large amounts of data on affordable hardware.
Hadoop Data Lakes are welcoming to data of all types, and there is an abundance of analysis tools to retrieve value from the assembled data. Companies use Exploratory Analytics, seeing if they can uncover new insights that can be applied to improve processes. If they are effective, they can also use the Data Lakes as a staging area for their Data Warehouses, transforming and aggregating the data to prepare it for loading to the data warehouse.
But in pursuit of the goal of cheap scale-out infrastructure, the “schema-on-read” approach sacrifices the ability to run operational workloads on the Data Lakes. What if you can use the scale-out infrastructure to capture both structured and unstructured data, and what if you can access the information using a full implementation of SQL, with millisecond response times on complex queries across petabytes of data? That is what Operational Data Lakes do.
Powering Operational Data Lakes with Splice Machine
Splice Machine is a relational DBMS that leverages HDFS, HBase and Spark to deliver the economics and horizontal scaling of a Hadoop Data Lake, while offering full ANSI SQL, ACID transactions, and real-time analytics to power even the most demanding operational applications.
The result is that Splice Machine can continuously and concurrently ingest large amounts of data from source systems, while supporting transactional applications such as customer service operations and operational reporting, as well as real-time analytical workloads to discover trends that require immediate action.
For a detailed description of the Splice Machine architecture, see “How It Works”.
Cetera Financial Group is building a single source of truth for 10,000 distributed users. Splice Machine has replaced a traditional RDBMS solution and consolidated multiple disparate legacy databases into a single Enterprise Data Hub delivering a single source of truth across a range of applications and use cases.
Sample Operational Data Lake Use Cases
Replace Operational Data Stores
An operational data lake has the following additional benefits over an ODS:
- Based on modern scale-out technology. Compared to older RDBMSs like Oracle, operational data lakes can be 5-10x faster with 75% less cost
- Handle semi-structured and unstructured data. When part of a larger Hadoop-based data lake, you can now analyze structured, semi-structured, and unstructured data together
Offloading Reporting and Analytics Tasks from SQL Databases
As the amount of data in traditional databases grows, their performance on reporting and analytical workloads suffers and negatively impacts their transactional duties.
- Using Splice Machine allows you to run reports and analytics faster, cheaper and without the impact on performance of the source systems
- Offloading reporting and analytics can delay, or even avoid, investments in expensive scale-up expansions from traditional database vendors
Complementing an Existing Hadoop Data Lake
For an existing Hadoop-based data lake, Splice Machine becomes a powerful and flexible repository for structured data:
- Using Splice Machine allows you to directly store structured data in the same relational framework as the native representations with no unnecessary transformations or extractions to flat files.
- Support operational applications and reports, run directly against the Data Lake
- Enable ad-hoc analytics through Hadoop tools such as MapReduce or Hive on both structured and unstructured data