By Erik Gfesser — Aug 27, 2014

Chicago Area Hadoop User Group (CHUG):Using HBase Co-Processors to Build a Distributed, Transactional RDBMS

July 30, 2014: 5:30 PM – 7:30 PM

From the promotional materials:

John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.

Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.

In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.

HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.

The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.

Splice Machine is the only Hadoop RDBMS
John Leach – Splice Machine Co-Founder and CTO
why does Splice Machine exist?
Leach was fortunate enough to try to aid the world's largest retailer in adopting Hadoop
the team ran a POC which was initially very impressive
they took massive POS files and loaded them into Hadoop very quickly
together with the ability to query this data with Hive etc, they thought this was the way to go
however, while running the POC they started thinking about what the actual project was going to look like beyond the POC
they came to realize that there were thousands of data sources throughout the organization being updated throughout the day and night
this data had duplication in it, so they would have to dedup the data
there were data corruption issues as well, so they would have to put constraints on the data to make sure that they would not end up querying garbage data
project stakeholders began quoting industry publications that talked about how Hadoop was going to change the world, and questioned why Leach was just calling Hadoop a file system
the executives sat Leach down, and said that if using Hadoop was going to be the solution, they should be able to hook in applications they already own into it
they basically said – we don't want a team of 100 people writing "some NoSQL serialization stuff with thousands of lines of Java code" – we already have a lot of code out there already written with standard ANSI SQL – "what's going on here?"
this was about 4 years ago, and Leach has worked on a few other projects since
one project was for a biotech firm where stakeholders wanted to update data as well as query data with a consistent view
Leach responded by inquiring as to how they wanted to interact with the data, and they responded "with SQL" of course – and what are you going to use for SQL? – long story short, the project ended up stuck in research land
another project at a consumer product firm about 2 1/2 years ago involved stakeholders who wanted to run a complete CRM repository on top of Hadoop – Leach wondered how that was possible
Leach ended up partnering with some colleagues from a prior venture to start Splice Machine
essentially, what they wanted to do was to build a relational database on top of Hadoop and HBase, to enable the same type of standard workloads that people are used to running on Oracle, PostgreSQL, and MySQL – using the ANSI SQL:1999 standard
why was this important? they wanted a transactional system, one that could work with any existing application – whether it be Hibernate, Unica, Cognos, etc – without any customizations or changing of SQL dialects
so far, Splice Machine has raised $22m and has about 50 people in San Francisco
data is doubling every 2 years, driven by web, social, mobile, and the internet of things
perhaps more importantly, operational data is growing fast as well
operational data stores are under a lot of pressure, but not just with little data marts or data warehouse snapshots upon which data analytics can be run – just operational data stores in general
go to your Oracle folks and ask them how many RACs they have
thousands? and next year there will be a lot more – this data is growing very quickly
traditional RDBMs are overwhelmed, and scale-up is becoming cost prohibitive
"reports take forever"
"data has to be thrown away"
"Oracle is too darn expensive"
"my database is hitting the wall"
"users keep getting those spinning beach balls"
scale-out is the future of databases
scale-up = increase server size
scale-out = more small servers
dramatic improvement in price/performance
Splice Machine is the only Hadoop RDBMS
replace your old RDBMS with a scale-out SQL database
affordable scale-out, ACID transactions, no application rewrites

case study – Harte-Hanks, a digital marketing services company
real-time campaign management
complex OLTP and OLAP environment, traditionally a non-Hadoop environment – nobody would say, "Let's put this on Hadoop"
even though they needed scale and cost reduction, most would say you can't do that
in addition to Cognos and Unica, they were using Ab Initio for ETL
Oracle RAC too expensive to scale
queries too slow – even up to 30 minutes
getting worse with 30%-50% data growth
looked 9 months for cost effective solution
they wanted something affordable that could afford the types of workloads they experienced
in the case of Harte-Hanks, they spent about 2 years optimizing their SQL
campaign management is "a stinker" – it's really hard, because even for a medium-sized retailer you still have to send out say 3m-4m emails
to accurately score that is a decent OLAP workload
you have to look at all the customer transactions during the time periods under investigation to see what really happened
it's a decent amount of activity that needs to be done all while serving as a list selection tool
in response to a question from the audience, without getting too specific with regard to this particular client, Leach indicated that firms in this space typically struggle with 3TB to 12TB of data
many would say this really isn't a lot of data – but there are a lot of companies in this space
Leach mentioned Exadata, but didn't go into detail and just called it "a different cost structure"
initial results: 10x-20x price/performance improvement with no application, BI, or ETL rewrites
1/4 the cost of Oracle environment with commodity scale out
3x-7x faster through parallelized queries
in response to a question from the audience, Leach indicated that most of the database activity here was reads, and that HBase traditionally executes inserts quickly
HBase uses an LSM tree, so as data comes in it goes into a concurrent skip list, a pretty efficient real-time structure, and while that happens a write-ahead log entry is performed
Splice Machine batches write-ahead log entries, and has a parallel write pipeline where reads are not blocked
Splice Machine focuses on price/performance because in reality databases can run faster and faster, they just cost an arm and a leg
Leach noted that they guarantee no data loss, which is difficult to guarantee with Hadoop
Splice Machine waits for files to sync on multiple machines, and unlike MapR, Splice Machine has dual name nodes

Leach went over some reference architectures
for operational applications, affordable scale-out is provided with a high concurrency of real-time reads and writes
Splice Machine is really targeted at the Oracle market
when the company was just getting started, an analyst asked them how many customers they thought Teradata has, and Leach responded with "millions" perhaps, who knows? – and the analyst said that Teradata actually had about 1500 at the time
how about Oracle? – at the time, about 400k customers
the analyst indicated that he thought the biggest challenge was that people looked at databases as OLTP tuned for transactions – or OLAP for analytics – applications actually sit in the middle and require a certain set of functions to run – and that is where Splice Machine should play, that is why Oracle is such a successful company

the Splice Machine concept of the data lake is a bit different
the general concept of the data lake is to just throw some files out there, and then eventually when someone wants to run some analytics they can do that
the data just sits there – it's not cleansed or constrained, it's not structured
there are some pros to having data in this state, applying the schema on read – you don't apply the schema until the data is read from the files
Splice Machine applies the schema on write
this means that primary key constraints, uniqueness constraints, and referential constraints can be enforced
in addition, data can be typed, so if something is wrong with the data, it won't be loaded
someone from the audience inquired about the types of connectors that Splice Machine offers, and Leach indicated that they haven't gone that route – the three options that are offered are JDBC, ODBC, and product specific command line commands that were created to load data from HTFS into Splice Machine
the company offers a number of demos, including Node.js, Grails, PHP with ODBC, etc to demonstrate JDBC and ODBC connectivity
another reference architecture that Leach discussed was the unified customer profile
where high concurrency exists and you are streaming data
everyone says, "I'm going to use Spark!" – but the key thing when using any streaming technology is that you need to sync, since a stream can only hold so much data at once and you need a place to put the data

when Splice Machine picked their first customers for POCs, they obviously wanted to pick different databases to go up against
sequences for example are really hard in a distributed system
Leach indicated that he doesn't think Hive really belongs on this particular slide, and thinks it's really about relational database technology – involving ACID and updates to data
in response to a question from the audience, Leach indicated that DB2 is a "weird one for us" because Splice Machine is actually built using Apache Derby, which follows DB2 syntax
a big advantage for Splice Machine is that a lot of software packages use DB2 dialects, and being able to walk in with the knowledge that Apache Derby is under the covers is really powerful

Hadoop is going to be the fabric for file system based storage, and while this may change over the years – it may not be called Hadoop, for example – there is still going to be a distributed file system, and on top of that file system will be a relational database
when Splice Machine set out to create a product, they obviously didn't want to write a SQL parser, so they looked at existing databases and decided on Apache Derby
Apache Derby is ANSI SQL:1999 compliant, Java-based, is actually used quite a bit on cell phones, and is extremely lightweight, which is very appealing
Apache Derby is also ODBC/JDBC compliant, and has unit tests built over a period of 10 years, so the team didn't need to hire people to do write these – the testing team would likely be bigger than the development team
Leach initially looked at Apache Cassandra, Voldemort etc, every key-value store out there – ran performance tests, and followed Facebook guidance in the sense that it's really nice to have a consistent database
Scalability is nice with HBase/HDFS because of the scalability that it provides, although Facebook did tweak it a bit to get it to scale so well to petabytes
Apache Derby is modular, lightweight, unicode – having i18n was a big thing for the Splice Machine team
native authentication is provided by Apache Derby, LDAP is provided, and the Apache Derby team was clever to build a custom authenticator
Apache Derby provides authorization down to the column
Apache Derby also has an ARIES based concurrency model, which Leach indicated was "a little bit of a tough model"
Apache Derby is a mature product and has been around for about 12 years
advanced features that Apache Derby provides are Java stored procedures, triggers, two-phase commit (XA support), updatable SQL views, full transaction isolation support, encryption, and Java custom functions

if you are using Java and want to execute a SQL statement, you create a PreparedStatement
Splice Machine looks it up in the cache, parses it with the JavaCC parser, and binds it to the dictionary, which Leach emphasized is under snapshot isolation as well
the plan is then optimized using a cost based optimizer, and generates code for the plan in bytecode, which is shipped to the nodes

Apache Derby is block file-based, similar to most singleton type databases, and the Splice Machine team replaced this storage with HBase tables
indexes were B-Tree in Apache Derby, and are dense indexes in HBase, which is a fancy way of saying that every row in the base table has a corresponding row in the index, which has advantages and disadvantages
the plan is to implement other index strategies in Splice Machine, but right now the focus is dense indexes
concurrency is the area in which the Splice Machine team spent the most time – Apache Derby is a lock-based ARIES system, which basically means that when you update data you lock it, which doesn't scale – so it was replaced with an MVCC structure, which HBase provides
for the project-restrict plan, Apache Derby predicates are on a centralized file scanner, and now the predicates are actually pushed into the HBase filters
the amount of data going across the network needs to be restricted – the network is your enemy – any time you need to do anything over the network, costs are significant – the cost based optimizer penalizes for anything that needs to be performed over the network
Apache Derby aggregations are serially computed, whereas with Splice Machine aggregations are pushed to shards and spliced back together (hence the name of the company)
Apache Derby has 2 join plans – basically hash and everything else a nested loop, which doesn't scale well – so 5 join plans were implemented for Splice Machine – classic broadcast where a small table is pushed to a big table, sort-merge where keys are created on the fly to join 2 tables requiring reshuffling the tables so the merge can be done, merge, nested loop join, and batch nested loop join (implemented in Oracle 12, and also recently implemented in MySQL) which is much faster than nested loop join
while Leach was providing an overview of some HBase features, he called out the fact that the team spent a lot of time on serialization
all data is stored in byte sorted order and packed into one column, stored basically the same as a relational database
why is this done? if you take say a 50-column table defined with varchar(20) fields that takes up 1TB in Oracle, it would take up 10TB in HBase compressed
in Oracle you have a bitset index which is what fields are set, packed together
in HBase you have key-values, your family name, column name, and timestamp for each column – you now have an I/O structure that doesn't work
customers have complained about HBase being slow – Leach inquires as to how they serialized the data, and they basically say that they took whatever was in their relational database and jammed it into each column in HBase
it's slow because it takes a long time to scan 10TB of data – you can't fix I/O – once it's broken, it's broken
with compression and replication, they cancel each other out and so Splice Machine stores 1TB of data with 3x compression, 3x replication
this is one area of Hadoop architecture that isn't discussed very often – if you want low latency, I/O matters – you need to try to use as little data on disk as possible, because you have to read it
Leach reiterated that both compression and serialization is important here
storing the data in sortable bytes provides the ability to perform byte level comparisons
if you are going to build a database on top of Hadoop, serialization is a really important piece – how big is the data on disk? how are predicates being applied? – byte level comparisons or conversion to objects, in which case you will have heap problems because you are creating a lot of objects
in response to a question from the audience, Leach explained that the way databases are moving forward, they are no longer OLTP or OLAP – they are hybrids – data comes in row-based and stored in some sort of columnar structure – the best paper on this subject is probably the one on C-Store by Stonebraker, which was commercialized by Vertica
Leach explains that he cringes when people talk about row-based and column-based databases, because a coherent database should really be both
it's not even columnar any more – it's whether you've already computed the data, already joined the data
if you think about ETL that is done at night, followed by reads during the day, the data should really be converted to a columnar format if it's not being written to – it provides better compression, better I/O, faster speeds

so what did the Splice Machine team do with HBase?
the biggest was the asynchronous write pipeline
writes to HBase are "terrible" – they block
so if you read 1k records and want to send them across the wire, while you send them you can't read any more records
this is just not workable – a database needs to have an asynchronous write pipeline, where you can read, say, 1m records as fast as possible while at the same time 10 concurrent writes are being performed at the same time, and the writes are flushable
it is really important to understand some of these low level details when the slowness of a database starts being questioned
another thing that the Splice Machine team did was to implement concurrent writes of data, indexes, and index constraints
Leach also reiterated that writes are batched in chunks, an important feature – so if you're writing 100 records in HBase, make sure it does one WALEdit versus 100 fsyncs, because that would be a huge penalty
someone from the audience inquired as to whether Splice Machine provides support for YARN resource management, and Leach indicated that this subject often comes up in sales deals
yes, you can run HBase on YARN – but the Splice Machine resource manager is different, it is modeled after the Linux scheduling concept, a task stealing queue
in YARN, you say you need this much memory, this much CPU – Splice Machine scheduling is much more about how many tasks can run in the DML queue, how many tasks can run in the DDL queue, and a queue always needs to be open for dictionary operations
so if I'm running a 10b record x 10b record join, and Splice Machine is configured to use the complete capacity of the cluster, when you connect you should still be able to show tables, show indexes, show the dictionary
Splice Machine supports Apache Mesos and YARN, but it has a governor for writes
one scenario involved a consultant wanting to build a 10b record index, and the node just kept failing – sure a lot can be written, but HBase just fell over because it was too much – they tried to write something like 300k writes per second, and HBase couldn't handle that on each of the nodes
so a governor needed to be written as part of the Splice Machine resource manager that indicates the most writes that a node can accept, and if it's more than what has been configured, tell it to back off and send a message back to the pipeline to provide a momentary delay and then try to resend
one focus of the Splice Machine team has been stability – the only governor HBase provided was the IPC server threads – essentially a resource pool that once filled you could no longer do anything with HBase other than queueing, waiting
for Splice Machine, a message instead gets sent back to the client when a limit is reached to indicate what should be done

when Splice Machine comes up, 76 dictionary tables are created in HBase
in response to a question from the audience, Leach reminded that in HBase you never update, you insert with a timestamp, and with snapshot isolation you can always see your data at a point in time – it's a temporal database – Oracle has this capability as well, called Flashback – everything is tombstone based, so if you drop a column you are really just adding tombstones
once you have snapshot isolation, incremental backups are really easy because you can go back to a point in time to see what the data looked like
someone from the audience commented that this is based on how many versions you tell HBase to save, and Leach replied that with Splice Machine all are saved – after discussing some related topics, he also mentioned a "manual vacuum command" that you can call that will cleanse a table back to the specified timestamp

Leach indicated that he thinks this is a terrible slide
as a general rule, the Splice Machine team doesn't share low-level details of the product
but there are some great papers out there on which the product is based
links to the research mentioned here are provided at the top of this blog post
Leach also mentioned the paper "A Critique of Snapshot Isolation" by Daniel Gómez Ferro, who now works for Splice Machine
Chen Zhang worked on a project called HBaseHI, and now works for Splice Machine as well

Leach thought that this slide might come across as too salesy
if Leach goes into a deal, and he's competing against Impala, he simply realizes that it isn't Splice Machine space
the Splice Machine space is more of an Oracle, PostgreSQL, and DB2 space – where you're updating data – generally "big iron" companies
Leach indicated that the way he looks at the database market is a bit different
there are analytics-based approaches, and operational-based approaches, with the latter likely consisting of both OLTP and OLAP
there is also the SQL-on-Hadoop space and the SQL-on-HBase space that includes products like Phoenix
Leach suggests checking back with products in the SQL-on-Hadoop space every 3 to 6 months, due to the drastic changes these products are experiencing over time – "it's definitely an arm's race"
the higher-cost products are from established players – but the one nice thing about them is that they work
we all laugh, but they're good products – Oracle works, for example – and it will scale up – but it's expensive
Leach also mentioned Teradata, which is very expensive, but he recalls talking to a Teradata customer who really likes the product as well
there are a lot of products out there that work very well and are not on Hadoop, but they are low risk
the biggest challenge for all the lower-cost players at the top of the grid in this slide is to be able to demonstrate effectiveness and get to more sophisticated capabilities, whether on the analytics or transactional side
in closing, Leach ran a very brief demo, during which time he posed the question – "what's so special about Splice Machine?"
to help provide an answer, he ran some commands at the Splice Machine command line, which deleted records and then rolled back
Leach explained that although running these commands looks very simple, what it shows is the ability to run an application with the database due to the transactions that it provides
in response to a question from the audience, Leach brought up Oracle Flashback again, which has 6 or 7 different features, one of which is to run a query from a specific point in time, and another which shows all data versions and changes – this functionality hasn't been a priority for Splice Machine as of yet
Leach indicated that there is only one customer so far – a SAAS company involved in sales trending – which really wants this type of functionality, although he has seen use cases for it, such as in trading and insurance
Leach would like to see a better way to do temporal data, but he hasn't seen it yet
Splice Machine will have Flashback functionality by the end of the year (2014), although Leach indicated that it will be interesting to evaluate performance under different scenarios

Subscribe to Erik on Software