The Perils of Modernizing SQL Apps on NoSQL
November 12, 2019
Over the past 40+ years, traditional relational databases like DB2, Oracle, SQL Server, Postgres, MySQL and many others have proved to be the bedrock on which hundreds of billions of dollars of applications have been built. Applications that interact with back-end databases through SQL are ubiquitous across all industries. If all of the applications that rely on SQL disappeared overnight our world would be thrown into absolute chaos. Amongst an incalculable number of effects: you wouldn’t be able to access or spend money in your bank, your healthcare records would be inaccessible, goods couldn’t be routed or shipped, etc. In many ways, modern life would be unrecognizable without SQL-based applications.
Rise of NoSQL Databases
In the early 2000s, the amount of data being generated and stored, the velocity it was arriving to the database, and the variety of data types began to explode. Initially, this was driven by clickstream data being generated by web applications, but today many other data sources, like IoT, generate similar challenges. Traditional databases, where SQL was the primary interface, struggled in this new environment. The vast majority had no practical way to horizontally scale and their rigid typing made them ill-equipped for storing completely unstructured data.
In response to these challenges new approaches were built, primarily based on mid-2000s research at several internet companies (e.g.,, Google, Amazon, and Yahoo) who were being overwhelmed by the data they were collecting.
One notable category that relies heavily on this research are NoSQL databases such as HBase, Cassandra, Dynamo, and MongoDB. These systems were built to deal with massive amounts of data through horizontally scaling architectures that could accommodate semi-structured or unstructured data.
Was Throwing Away SQL a Mistake?
Why was SQL initially abandoned by NoSQL projects and vendors? The primary arguments were 1) that an upfront schema makes your business less agile in dealing with semi-structured and unstructured data and 2) SQL was slow and its expressiveness was unnecessary.
First, it’s definitely true that good schema design is difficult, and requires you to understand your data upfront. When data types are changing rapidly, not being locked into a schema seems imminently reasonable and attractive.
The problem is, that all analysis requires that you understand the structure of your data. Structure, in the form of a defined schema, must be applied to do anything useful. Applying structure only when the data is read, a “schema on read” approach, does reduce up-front data engineering work, but it pushes the problem to the actual data users. Typically, this is not a good tradeoff. There are typically many more data users than data engineers, and pushing the responsibility to data users can make a task that would be done once up-front, to a task that needs to be done repeatedly downstream. Additionally, data users often don’t have much context outside of their immediate responsibilities, and might misidentify or misuse data. There’s also the risk that different data users might not interpret data in the same way, creating problems in understanding and reconciling redundant and inconsistent usages.
Second, it turned out that SQL expressiveness was useful. There are a multitude of reasons for this that we’ll touch on in the next section, but the clearest proof point is that many NoSQL projects and vendors implemented “SQL-like” query languages. Cassandra’s CQL is one example. In some ways, these “SQL-like” query languages are helpful. They effectively map NoSQL specific syntax to more widely understood SQL syntax, reduce the “new language” burden on developers, and potentially open some 3rd party integration.
In other ways having a query language that looks like SQL, but is actually not, is counterproductive. It risks lulling developers into a false sense of confidence. They may overestimate the actual level of SQL support present in these languages, and discover vital gaps only after a project is well underway. Potentially worse, “SQL like” functionality that technically works on a proof-of-concept may not work on the full production system because of differences in functional properties that percolate up from the underlying NoSQL database. Sprinkling some “SQL-like” syntactic sugar on top of a NoSQL database won’t make something like table joins “work” in a production environment if the underlying NoSQL database hasn’t been designed or modified with that purpose in mind.
A lack of SQL in NoSQL systems is in and of itself no inherent advantage. All things being equal, there is no virtue in abandoning a well understood, extremely widely deployed data query language for a NoSQL specific language. On the other hand, abandoning SQL completely or using a “SQL like” language has some clear disadvantages, especially in the context of application modernization.
Abandoning SQL is Risky
Over the last 40+ years, companies have made massive investments in mission-critical applications that embody their core competencies. Data is core to virtually all applications, and legacy databases were the primary choice for data storage and retrieval.
A pressing need to modernize these applications often appears when the data that the application stores and operates over has grown beyond what the underlying legacy database can handle, to make the application smarter by leveraging new sources of data that have been made available since the application’s original deployment, and to incorporate modern techniques for utilizing data like artificial intelligence and machine learning.
For example, a demand planning application for a retailer might begin incorporating data about the current weather or local events that might influence local traffic to more efficiently allocate inventory. Or consider an insurance claims system that incorporates vehicle sensor data and camera imagery to more intelligently process claims.
NoSQL databases are often considered for these application modernization projects. In some ways, this makes sense. They do provide the ability to scale-out to support new data sources. However, when replacing a legacy SQL database with NoSQL, you could be significantly lengthening the project and increasing risk.
First, a lack of full SQL support necessitates a large amount of application rewrite. These applications speak SQL. Even when there is signifcant functionality present in a NoSQL database, using the functionality requires the application to “speak” a custom NoSQL or non-compliant “SQL-like” dialect. Additionally, finding developers with the required expertise is significantly more difficult than the huge numbers of developers already fluent in SQL. Tools to assist in converting existing SQL to NoSQL dialects are incomplete or non-existent.
Next, the underlying data model often needs to be completely rethought. As an example, the lack of performant table joins (or in some cases, the ability to join data at all) can cause an explosion of data denormalization that causes data duplication that wastes hardware resources and necessitates extensive new code to ensure that the duplicated data remains consistent across all of its definitions.
Additionally, it is often difficult to achieve similar performance to legacy databases. NoSQL systems typically excel at short-running operational queries that only access a single row or a handful of rows. Performance on analytical queries that scan large amounts of historical data is often poor, and not sufficient to meet the application’s needs. Often, NoSQL databases are paired with another system more appropriate for long-running analytical queries through an implementation of the Lambda architecture. These architectures are complex, difficult to implement, introduce many potential avenues for data inconsistency, and imply a large ongoing maintenance cost.
Also, many database and SQL features that the application relied upon in the legacy database are often not available and need to be built from scratch by your developers or eliminated entirely. Replicating some features present in legacy SQL databases in NoSQL systems, like a consistency model that supports full ACID support, requires such specialized developer skills and such a huge amount of effort that any project that relies on reimplementation is going to spend a huge chunk of time effectively building a one-off database instead of actually modernizing the application.
A much less risky approach for application modernization is to replace your legacy database with a database that scales-out, allows data type flexibility, and provides in-database machine learning and artificial intelligence without abandoning SQL. Does such a database exist?
Can you get the advantages of NoSQL with SQL?
There are a number of scale-out, distributed SQL systems in the marketplace today like Snowflake, Cockroach, and Google Cloud Spanner as well as the company I co-founded Splice Machine. With all of these, I believe it is possible to get the benefits of NoSQL databases without accepting the problems discussed above. To go deeper in one of the solutions, Splice Machine uses a NoSQL system (HBase) as its underlying storage engine but it layers on extensive capabilities specifically designed to modernize OLTP SQL applications It provides:
- An exhaustive set of supported ANSI SQL syntax
- Row storage with a primary key for quick lookups
- Efficient column storage in tables that can be stored as cheap object storage like S3 or ADLS
- Full ACID compliance and multi-row transaction support
- Support for constraints, referential integrity, primary keys, triggers, and secondary indexes
- In-memory, distributed analytical processing
- A cost based optimizer that devises efficient query plans and can choose to execute these plans directly against HBase (typically for small, “operational” queries) or via Spark executors (typically for larger, “analytical” queries)
- Support for custom stored procedures and functions
- The ability to deploy on-premises or on AWS and Microsoft Azure with Kubernetes
- The ability to store semi-structured or unstructured data directly in database tables or through integrated Spark functionality
- Integrated data science capabilities with in-database analytics, integrated Jupyter notebooks, model workflow management, and cloud deployment capabilities
A scale-out SQL RDBMS like this can facilitate legacy modernization.
First, like NoSQL databases, it can horizontally scale to extremely large data sizes and query volumes, far beyond what’s possible on legacy databases. It also allows extensive flexibility for storing and manipulating data.
Second, it makes migration from the supporting legacy database simple. SQL written against the legacy database needs very little or no rework because of Splice Machine’s exhaustive SQL support. The underlying legacy data model often needs no modification and certainly does not require the wholesale rethinking that a migration to a NoSQL database requires. Splice Machine can handle all of the legacy database workload, whether operational or analytical in nature, without the need to stitch together multiple specialized systems.
Finally, it provides capabilities that allow a migrated application to be “modern”. It can run on-premises or on multiple cloud platforms and abstract away the underlying infrastructure through a modern “containerized” approach. It allows data science to be done directly in-database, so that new data sources and techniques that make applications smarter and more valuable can be applied without the hassle of extracting, transforming, and loading data between multiple systems.
Custom applications built on legacy databases often deliver competitive advantages. That is why companies keep them custom versus licensing packaged applications. Modernizing these applications is often desirable, either because the underlying database is under stress and can’t scale, or because there’s a desire to leverage modern deployment or AI techniques. Replacing the legacy database that underlies these applications with a NoSQL database that requires wholesale rework of the existing data model and business logic is risky and significantly increases the scope and schedule of a modernization project. Scale-out SQL databases like Splice Machine provide scalability and flexibility while providing a smooth migration path from a legacy database system.
To learn more about how to make your custom-built applications agile and infused with intelligence download, our white paper on modernizing legacy applications.