Stuck in the Middle: The Future of Data Integration is No ETL
July 13, 2015
The ETL (extract, transform and load) process was one born out of necessity, but it’s now a relic of the relational database era. Scale-out platforms like Hadoop and Spark provide the means to move beyond ETL, with lower cost data storage and processing power. Let’s look at how we got here and how we can get the EL out of the ETL process treadmill.
The ETL Process: Your IT Department’s Hidden Burden
As enterprise systems proliferated, ETL enabled IT departments to extract data from relational databases powering mission-critical business applications, transform it into the right aggregations, and load it into data warehouses (DWs) or operational data stores (ODSs) purpose-built for analytics. With every additional source being extracted, this data integration workload grew larger, leaving many companies with timeworn ETL processes that cannot keep up with their 30-40% yearly data growth.
Keeping the ETL process running has become a daily grind that can include:
Ongoing database tuning to address performance issues
Constant updating of ETL scripts to handle changing sources and reports
Correcting errors and performance issues that can lead to delayed reports as ETL windows are missed
While the most direct solution is a costly scale-up of existing systems to support larger data volumes, most IT departments simply have to press on and keep ETL running as best they can. While IT departments are responsible for the ETL process, those on the business side of the organization feel the pain too. For those looking to make decisions based on the data within an ODS or DW, ETL has become a bottleneck that keeps them from accessing timely data.
ETL on Hadoop: Moving Data Integration in the Right Direction
Hadoop’s cost-effective scalability has made it a potential fix for ETL processes that are struggling with data growth. It has the support of major players in the space, including companies like Cloudera and Dell. By dropping Hadoop into the process to handle data transformation and then using Hive, Impala or other Hadoop-based analytics tools, companies can transform their cost structures and shift away from expensive DW and ODS options.
However, ETL on Hadoop can be brittle because of its batch processing nature. If there are any errors to fix or records to update, the entire ETL job has to be restarted, wasting valuable hours. By trading away the transactional integrity and ACID compliance of relational database systems, ETL on Hadoop can ultimately decrease performance, mitigating the gains in intelligence that can be found from tapping into more data.
Fixing ETL on Hadoop with a Hadoop RDBMS
Splice Machine solves these problems with a Hadoop RDBMS – a highly concurrent, read-write repository in the form of a fully transactional SQL database that runs on Hadoop. Splice Machine provides the best of both worlds: the scale-out architecture and cost savings of Hadoop, with the full transactional capabilities provided by an RDBMS.
Organizations can update and delete data on the fly reliably, even at the record level. Because Splice Machine is a read-write system with full support for transactions, a job can be restarted from the last fully executed transaction if a failure occurs. Unlike vanilla Hadoop ETL solutions, Splice Machine enables companies to solve issues like duplicate data while the job is running, eliminating time-wasting restarts.
In addition to clearing up ETL process issues, the Splice Machine Hadoop RDBMS can run and execute real-time reporting workloads. This is a valuable capability for organizations as they begin to use Hadoop for more than just ETL. Instead of having data in a file system, the data is in a full relational database that supports other non-ETL processes, like analysis and powering real-time applications.
The Future: No ETL
ETL exists because traditional systems could not handle both OLTP and OLAP in one system and provide good performance for both. ETL on Hadoop opens the door to a new way of thinking about ETL because it changes the cost structure around harnessing Big Data. ETL became critical because it enabled tiered storage where data from operational applications could be managed and manipulated in a data warehouse, instead of burdening the front-end system. A typical Hadoop system can be a substitute for a data warehouse, but not for operational applications.
Now, advances in in-memory technologies like Spark may make ETL obsolete by running OLTP and OLAP applications from the same platform. This eliminates the need to extract or load data, as all applications could pull from one instance of HDFS and just be transformed to fit the format of the target database. This could take today’s ETL processes from hours and days, to mere seconds, enabling applications and analysts to benefit from near-real-time data that is delivered seamlessly throughout the day. We look forward to watching more advances in the field.
Learn more about Streaming the ETL Pipeline with Hadoop with our recent White Paper: