Operational Data Lakes…a Contradiction in Terms?

The Data Lake was introduced as an answer to the problem of having important data locked up in silos of production applications and departmental databases. The assumption was that loading all that data into a single structure would be able to reveal important insights about the operation of a company. The trend was amplified by the fact that Hadoop made it possible to store large amounts of data on affordable hardware.

Hadoop Data Lakes are welcoming to data of all types, and there is an abundance of analysis tools to retrieve value from the assembled data. Companies use Exploratory Analytics, seeing if they can uncover new insights that can be applied to improve processes. If they are effective, they can also use the Data Lakes as a staging area for their Data Warehouses, transforming and aggregating the data to prepare it for loading to the data warehouse.

But in pursuit of the goal of cheap scale-out infrastructure, the “schema-on-read” approach sacrifices the ability to run operational workloads on the Data Lakes. What if you can use the scale-out infrastructure to capture both structured and unstructured data, and what if you can access the information using a full implementation of SQL, with millisecond response times on complex queries across petabytes of data? That is what Operational Data Lakes do.

Powering Operational Data Lakes with Splice Machine

Splice Machine is a relational DBMS that leverages HDFS, HBase and Spark to deliver the economics and horizontal scaling of a Hadoop Data Lake, while offering full ANSI SQL, ACID transactions, and real-time analytics to power even the most demanding operational applications.

The result is that Splice Machine can continuously and concurrently ingest large amounts of data from source systems, while supporting transactional applications such as customer service operations and operational reporting, as well as real-time analytical workloads to discover trends that require immediate action.

For a detailed description of the Splice Machine architecture, see “How It Works”.

"As an expanding company, our wealth management technology platform experienced rapid data growth, and we needed additional tools to quickly access our growth in analytic data to guide strategic decisions and optimize our business processes. Moving to an Enterprise Data Hub powered by Splice Machine resulted in significant performance improvements." Mohan Gurupackiam CTO, Cetera Financial Group

Cetera Financial Group is building a single source of truth for 10,000 distributed users. Splice Machine has replaced a traditional RDBMS solution and consolidated multiple disparate legacy databases into a single Enterprise Data Hub delivering a single source of truth across a range of applications and use cases.

Sample Operational Data Lake Use Cases

Replace Operational Data Stores

An operational data lake has the following additional benefits over an ODS:

  • Based on modern scale-out technology. Compared to older RDBMSs like Oracle, operational data lakes can be 5-10x faster with 75% less cost
  • Handle semi-structured and unstructured data. When part of a larger Hadoop-based data lake, you can now analyze structured, semi-structured, and unstructured data together

Offloading Reporting and Analytics Tasks from SQL Databases

As the amount of data in traditional databases grows, their performance on reporting and analytical workloads suffers and negatively impacts their transactional duties.

  • Using Splice Machine allows you to run reports and analytics faster, cheaper and without the impact on performance of the source systems
  • Offloading reporting and analytics can delay, or even avoid, investments in expensive scale-up expansions from traditional database vendors

Complementing an Existing Hadoop Data Lake

For an existing Hadoop-based data lake, Splice Machine becomes a powerful and flexible repository for structured data:

  • Using Splice Machine allows you to directly store structured data in the same relational framework as the native representations with no unnecessary transformations or extractions to flat files.
  • Support operational applications and reports, run directly against the Data Lake
  • Enable ad-hoc analytics through Hadoop tools such as MapReduce or Hive on both structured and unstructured data