SQL-on-Hadoop Considerations

May 6, 2016
SQL-on-Hadoop Considerations

It’s no surprise that many companies are looking into SQL-on-Hadoop options. Standard SQL enables companies to tap into existing SQL-trained resources while being able to benefit with the scale-out cost-effectiveness of Hadoop.

With so many SQL-on-Hadoop options available, selecting the right one can be a daunting task for any CTO or IT manager. Here are some questions to ask to help ensure that you get a SQL-on-Hadoop solution that meets your specific needs:

  1. Can it support real-time apps?
    This includes real-time operational analytics and traditional operational applications such as web, mobile, and social media applications as well as enterprise software. Real-time applications require queries to respond in milliseconds to seconds versus minutes to hours.
  1. Is the SQL-on-Hadoop really real-time?
    This means real-time queries on real-time data. Some solutions claim to be real-time because they can do real-time ad-hoc queries, but it is not real-time if the data is from yesterday’s ETL (Extract, Transform, Load).
  1. Can it perform reliable, real-time updates?
    Database transactions are required to perform real-time updates without data loss or corruption. This is even important in analytics, as data and secondary indices need to be updated together to ensure consistency.
  1. Is it true SQL?
    In other words, it should support standard ANSI SQL. Many solutions provide a limited, unusual variant that requires retraining and partially rewriting applications.
  1. Can it support concurrent user updates consistently without locking up the database?
    As mentioned above, transactions are critical to supporting concurrent users with reliable, real-time updates. Transactions need to span multiple rows and tables. They also should be lockless for high throughput and to avoid troublesome deadlock and starvation conditions that freeze up applications and render them useless.
  1. Can you customize with UDFs?
    User Defined Functions (UDFs) enable you to add custom functionality to the database. You should be able to write the UDFs in a high-level programming language and parallelize the processing of the UDFs across nodes.
  1. Does it support triggers?
    Triggers watch for certain events (e.g. record insert or deletion) to audit changes or to enforce business rules. This is often critical functionality for many applications.
  1. Does the solution support column-based privileges?
    Can you specify what columns are accessible to user, groups, and roles?
  1. Can you connect to existing tools through ODBC/JDBC connectors?
    ODBC/JDBC connectors allow you to use existing SQL tools such as DbVisualizer and BI tools such as Tableau.
  1. Is there a RESTful API?
    RESTful APIs enable developers to quickly build web-based applications.
  1. Can it efficiently handle sparse data?
    In many large data sets, each attribute or column may be sparsely populated. In traditional databases, an empty value must still be stored as a null, which consumes storage. Modern databases should not require nulls for empty values. For instance if a table has 500 columns and a row only needs 2 columns, then the database should only store 2 values and not 498 nulls.
  1. Can you add a column without table scans?
    Data requirements change frequently and often require schema changes. However, you don’t want to disrupt the business to make those changes. Adding a column should not require full table scans.
  1. Is the data compressed?
    Hadoop reduces storage costs, but it’s still not free due to replication needed for availability. Any solution should compress the data by at least 5x, preferably 10x.
  1. Can it perform fast lookups on small subset of the data?
    Most analysts want to perform interactive, ad-hoc queries on a small subset of data in Hadoop. They don’t want to do batch analysis that takes minutes, if not hours, to complete.
  1. Does it support secondary indexes?
    Often data is organized along one dimension for fast updating (such as a customer number) but later must be looked up by other dimensions (such as zip code). Secondary indexes enable databases to lookup data across many dimensions efficiently.
  1. Can it deliver fast performance on massive joins?
    Analysis on massive data sets often requires at least billion row by billion row joins. No one wants to wait hours for those joins to finish or fail due to memory limitations on nodes.
  1. Does it provide multiple join strategies?
    Joins combine data from multiple tables. With a distributed infrastructure like Hadoop that handles very large data sets, multiple join strategies such as nested loop, sort-merge, and broadcast joins are needed to ensure fast join performance.
  1. Is there a cost-based optimizer?
    Performance on large data sets greatly depends on choosing the right execution strategy. Simple rules-based optimizers are not enough. Cost-based optimizers that account for the actual cost to execute a query are critical to optimal query performance.
  1. Can data import be parallelized?
    Many applications require batch input of data sources which can create a bottleneck. Can the platform use all the available computing resources to import data?
  1. Does the execution of queries exploit data locality?
    In a Hadoop-based system, data is distributed across many machines. A high-performance solution must move computation to where the data is stored locally versus always experiencing the network latency to move data across nodes.
  1. Is In-Memory Technology available?
    If response time is critical for your use cases, in-memory database capabilities give the ability to analyze large amounts of data quickly.
  1. Can it support simultaneous workloads for OLAP and OLTP?
    Many times a database says it can handle OLAP and OLTP, but really if one is being run through the system, it may slow the other down. For instance, if you are running large OLAP queries on a large dataset and at the same time need to do a quick update to the database. A simultaneous workload allows both to happen without interrupting the other or causing delays.

Choosing a SQL-on-Hadoop solution is a critical decision that will have a long-term impact on your application infrastructure. While you may not need a solution that meets all of these criteria currently, it’s important to consider how your requirements may change over time.

A Real-Time SQL-on-Hadoop Database

The Splice Machine RDBMS is the first RDBMS powered by Hadoop and Spark. Leveraging in-memory technology from Spark and scale-out capabilities from Hadoop, Splice Machine can replace Oracle® and MySQL™ databases, while increasing performance by 10-20x at one-fourth the cost. With an innovative, hybrid architecture and advanced resource isolation, the Splice Machine RDBMS provides exceptional performance for simultaneous OLAP and OLTP workloads, enabling companies to unlock the insights in their Big Data to make decisions in the moment. 

Learn More

Download our white paper, Ask the Right Questions When Choosing a SQL-on-Hadoop Solution, to get more information about SQL-on-Hadoop options: