Apache HBase: Why We Use It and Believe In It
September 17, 2015
What is HBase?
Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data.
HBase supports random, real-time read/write access with a goal of hosting very large tables atop clusters of commodity hardware. HBase features include:
- Consistent reads and writes
- Automatic and configurable sharding of tables
- Automatic failover support
How HBase Works:
HBase uses ZooKeeper for coordination of “truth” across the cluster. As region servers come online, they register themselves with ZooKeeper as members of the cluster. Region servers have shards of data (partitions of a database table) called “regions”.
When a change is made to a row, it is updated in a persistent Write-Ahead-Log (WAL) file and Memstore, the sorted memory cache for HBase. Once Memstore fills, its changes are “flushed” to HFiles in HDFS. The WAL ensures that HBase does not lose the change if Memstore loses its data before it is written to an HFile.
During a read, HBase checks to see if the data exists first in Memstore, which can provide the fastest response with direct memory access. If the data is not in Memstore, HBase will retrieve the data from the HFile.
HFiles are replicated by HDFS, typically to at least 3 nodes. HBase always writes to the local node first and then replicates to other nodes. In the event of a node failure, HBase will assign the regions to another node that has a local HFile copy replicated by HDFS.
Why do we use HBase?
Splice Machine has chosen to replace the storage engine in Apache Derby (our customized SQL-database) with HBase to leverage its ability scale out on commodity hardware. HBase co-processors are used to embed Splice Machine in each distributed HBase region (i.e., data shard). This enables Splice Machine to achieve massive parallelization by pushing the computation down to each data shard.
Benefits of HBase within Splice Machine include:
- Strong consistency – writes and reads are always consistent as compared to eventually consistent databases like Cassandra
- Proven scalability to dozens of petabytes
- Scaling with commodity hardware
- Cost-effective from gigabytes to petabytes
- High availability through failover and replication
- Parallelized query execution across cluster
Splice Machine does not modify HBase, so it may be used with any standard Hadoop distribution that has HBase. Supported Hadoop distributions include Cloudera, MapR and Hortonworks.
Splice Machine has an innovative integration with HBase, including:
- Asynchronous write pipeline which supports non-blocking, parallel writes to across the cluster.
- Synchronization free internal scanner synchronized external scanners.
- Linux scheduler modeled resource manager which resources queues that handle DDL, DML, Dictionary and Maintenance Operations.
- Sparse Data Support which efficiently stores data but not storing nulls for sparse data.
The Splice Machine schema advantage on Hbase includes non-blocking schema changes so that you can add columns in a DDL transaction and does not lock read/writes while you are adding columns.
White Paper: Learn more about how Splice Machine with our White Paper
Blog: Find out more about our use of Apache Derby.
* Source: Apache HBase http://hbase.apache.org