Chicago Area Hadoop User Group (CHUG):Using HBase Co-Processors to Build a Distributed, Transactional RDBMS
July 30, 2014: 5:30 PM – 7:30 PM
From the promotional materials:
John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing.
Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update.
In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle.
HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing.
The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions.
- Splice Machine is the only Hadoop RDBMS
- John Leach – Splice Machine Co-Founder and CTO
- why does Splice Machine exist?
- Leach was fortunate enough to try to aid the world's largest retailer in adopting Hadoop
- the team ran a POC which was initially very impressive
- they took massive POS files and loaded them into Hadoop very quickly
- together with the ability to query this data with Hive etc, they thought this was the way to go
- however, while running the POC they started thinking about what the actual project was going to look like beyond the POC
- they came to realize that there were thousands of data sources throughout the organization being updated throughout the day and night
- this data had duplication in it, so they would have to dedup the data
- there were data corruption issues as well, so they would have to put constraints on the data to make sure that they would not end up querying garbage data
- project stakeholders began quoting industry publications that talked about how Hadoop was going to change the world, and questioned why Leach was just calling Hadoop a file system
- the executives sat Leach down, and said that if using Hadoop was going to be the solution, they should be able to hook in applications they already own into it
- they basically said – we don't want a team of 100 people writing "some NoSQL serialization stuff with thousands of lines of Java code" – we already have a lot of code out there already written with standard ANSI SQL – "what's going on here?"
- this was about 4 years ago, and Leach has worked on a few other projects since
- one project was for a biotech firm where stakeholders wanted to update data as well as query data with a consistent view
- Leach responded by inquiring as to how they wanted to interact with the data, and they responded "with SQL" of course – and what are you going to use for SQL? – long story short, the project ended up stuck in research land
- another project at a consumer product firm about 2 1/2 years ago involved stakeholders who wanted to run a complete CRM repository on top of Hadoop – Leach wondered how that was possible
- Leach ended up partnering with some colleagues from a prior venture to start Splice Machine
- essentially, what they wanted to do was to build a relational database on top of Hadoop and HBase, to enable the same type of standard workloads that people are used to running on Oracle, PostgreSQL, and MySQL – using the ANSI SQL:1999 standard
- why was this important? they wanted a transactional system, one that could work with any existing application – whether it be Hibernate, Unica, Cognos, etc – without any customizations or changing of SQL dialects
- so far, Splice Machine has raised $22m and has about 50 people in San Francisco
- data is doubling every 2 years, driven by web, social, mobile, and the internet of things
- perhaps more importantly, operational data is growing fast as well
- operational data stores are under a lot of pressure, but not just with little data marts or data warehouse snapshots upon which data analytics can be run – just operational data stores in general
- go to your Oracle folks and ask them how many RACs they have
- thousands? and next year there will be a lot more – this data is growing very quickly
- traditional RDBMs are overwhelmed, and scale-up is becoming cost prohibitive
- "reports take forever"
- "data has to be thrown away"
- "Oracle is too darn expensive"
- "my database is hitting the wall"
- "users keep getting those spinning beach balls"
- scale-out is the future of databases
- scale-up = increase server size
- scale-out = more small servers
- dramatic improvement in price/performance
- Splice Machine is the only Hadoop RDBMS
- replace your old RDBMS with a scale-out SQL database
- affordable scale-out, ACID transactions, no application rewrites
- case study – Harte-Hanks, a digital marketing services company
- real-time campaign management
- complex OLTP and OLAP environment, traditionally a non-Hadoop environment – nobody would say, "Let's put this on Hadoop"
- even though they needed scale and cost reduction, most would say you can't do that
- in addition to Cognos and Unica, they were using Ab Initio for ETL
- Oracle RAC too expensive to scale
- queries too slow – even up to 30 minutes
- getting worse with 30%-50% data growth
- looked 9 months for cost effective solution
- they wanted something affordable that could afford the types of workloads they experienced
- in the case of Harte-Hanks, they spent about 2 years optimizing their SQL
- campaign management is "a stinker" – it's really hard, because even for a medium-sized retailer you still have to send out say 3m-4m emails
- to accurately score that is a decent OLAP workload
- you have to look at all the customer transactions during the time periods under investigation to see what really happened
- it's a decent amount of activity that needs to be done all while serving as a list selection tool
- in response to a question from the audience, without getting too specific with regard to this particular client, Leach indicated that firms in this space typically struggle with 3TB to 12TB of data
- many would say this really isn't a lot of data – but there are a lot of companies in this space
- Leach mentioned Exadata, but didn't go into detail and just called it "a different cost structure"
- initial results: 10x-20x price/performance improvement with no application, BI, or ETL rewrites
- 1/4 the cost of Oracle environment with commodity scale out
- 3x-7x faster through parallelized queries
- in response to a question from the audience, Leach indicated that most of the database activity here was reads, and that HBase traditionally executes inserts quickly
- HBase uses an LSM tree, so as data comes in it goes into a concurrent skip list, a pretty efficient real-time structure, and while that happens a write-ahead log entry is performed
- Splice Machine batches write-ahead log entries, and has a parallel write pipeline where reads are not blocked
- Splice Machine focuses on price/performance because in reality databases can run faster and faster, they just cost an arm and a leg
- Leach noted that they guarantee no data loss, which is difficult to guarantee with Hadoop
- Splice Machine waits for files to sync on multiple machines, and unlike MapR, Splice Machine has dual name nodes
- Leach went over some reference architectures
- for operational applications, affordable scale-out is provided with a high concurrency of real-time reads and writes
- Splice Machine is really targeted at the Oracle market
- when the company was just getting started, an analyst asked them how many customers they thought Teradata has, and Leach responded with "millions" perhaps, who knows? – and the analyst said that Teradata actually had about 1500 at the time
- how about Oracle? – at the time, about 400k customers
- the analyst indicated that he thought the biggest challenge was that people looked at databases as OLTP tuned for transactions – or OLAP for analytics – applications actually sit in the middle and require a certain set of functions to run – and that is where Splice Machine should play, that is why Oracle is such a successful company
- the Splice Machine concept of the data lake is a bit different
- the general concept of the data lake is to just throw some files out there, and then eventually when someone wants to run some analytics they can do that
- the data just sits there – it's not cleansed or constrained, it's not structured
- there are some pros to having data in this state, applying the schema on read – you don't apply the schema until the data is read from the files
- Splice Machine applies the schema on write
- this means that primary key constraints, uniqueness constraints, and referential constraints can be enforced
- in addition, data can be typed, so if something is wrong with the data, it won't be loaded
- someone from the audience inquired about the types of connectors that Splice Machine offers, and Leach indicated that they haven't gone that route – the three options that are offered are JDBC, ODBC, and product specific command line commands that were created to load data from HTFS into Splice Machine
- the company offers a number of demos, including Node.js, Grails, PHP with ODBC, etc to demonstrate JDBC and ODBC connectivity
- another reference architecture that Leach discussed was the unified customer profile
- where high concurrency exists and you are streaming data
- everyone says, "I'm going to use Spark!" – but the key thing when using any streaming technology is that you need to sync, since a stream can only hold so much data at once and you need a place to put the data
- when Splice Machine picked their first customers for POCs, they obviously wanted to pick different databases to go up against
- sequences for example are really hard in a distributed system
- Leach indicated that he doesn't think Hive really belongs on this particular slide, and thinks it's really about relational database technology – involving ACID and updates to data
- in response to a question from the audience, Leach indicated that DB2 is a "weird one for us" because Splice Machine is actually built using Apache Derby, which follows DB2 syntax
- a big advantage for Splice Machine is that a lot of software packages use DB2 dialects, and being able to walk in with the knowledge that Apache Derby is under the covers is really powerful
- Hadoop is going to be the fabric for file system based storage, and while this may change over the years – it may not be called Hadoop, for example – there is still going to be a distributed file system, and on top of that file system will be a relational database
- when Splice Machine set out to create a product, they obviously didn't want to write a SQL parser, so they looked at existing databases and decided on Apache Derby
- Apache Derby is ANSI SQL:1999 compliant, Java-based, is actually used quite a bit on cell phones, and is extremely lightweight, which is very appealing
- Apache Derby is also ODBC/JDBC compliant, and has unit tests built over a period of 10 years, so the team didn't need to hire people to do write these – the testing team would likely be bigger than the development team
- Leach initially looked at Apache Cassandra, Voldemort etc, every key-value store out there – ran performance tests, and followed Facebook guidance in the sense that it's really nice to have a consistent database
- Scalability is nice with HBase/HDFS because of the scalability that it provides, although Facebook did tweak it a bit to get it to scale so well to petabytes
- Apache Derby is modular, lightweight, unicode – having i18n was a big thing for the Splice Machine team
- native authentication is provided by Apache Derby, LDAP is provided, and the Apache Derby team was clever to build a custom authenticator
- Apache Derby provides authorization down to the column
- Apache Derby also has an ARIES based concurrency model, which Leach indicated was "a little bit of a tough model"
- Apache Derby is a mature product and has been around for about 12 years
- advanced features that Apache Derby provides are Java stored procedures, triggers, two-phase commit (XA support), updatable SQL views, full transaction isolation support, encryption, and Java custom functions
- if you are using Java and want to execute a SQL statement, you create a PreparedStatement
- Splice Machine looks it up in the cache, parses it with the JavaCC parser, and binds it to the dictionary, which Leach emphasized is under snapshot isolation as well
- the plan is then optimized using a cost based optimizer, and generates code for the plan in bytecode, which is shipped to the nodes
- Apache Derby is block file-based, similar to most singleton type databases, and the Splice Machine team replaced this storage with HBase tables
- indexes were B-Tree in Apache Derby, and are dense indexes in HBase, which is a fancy way of saying that every row in the base table has a corresponding row in the index, which has advantages and disadvantages
- the plan is to implement other index strategies in Splice Machine, but right now the focus is dense indexes
- concurrency is the area in which the Splice Machine team spent the most time – Apache Derby is a lock-based ARIES system, which basically means that when you update data you lock it, which doesn't scale – so it was replaced with an MVCC structure, which HBase provides
- for the project-restrict plan, Apache Derby predicates are on a centralized file scanner, and now the predicates are actually pushed into the HBase filters
- the amount of data going across the network needs to be restricted – the network is your enemy – any time you need to do anything over the network, costs are significant – the cost based optimizer penalizes for anything that needs to be performed over the network
- Apache Derby aggregations are serially computed, whereas with Splice Machine aggregations are pushed to shards and spliced back together (hence the name of the company)
- Apache Derby has 2 join plans – basically hash and everything else a nested loop, which doesn't scale well – so 5 join plans were implemented for Splice Machine – classic broadcast where a small table is pushed to a big table, sort-merge where keys are created on the fly to join 2 tables requiring reshuffling the tables so the merge can be done, merge, nested loop join, and batch nested loop join (implemented in Oracle 12, and also recently implemented in MySQL) which is much faster than nested loop join
- while Leach was providing an overview of some HBase features, he called out the fact that the team spent a lot of time on serialization
- all data is stored in byte sorted order and packed into one column, stored basically the same as a relational database
- why is this done? if you take say a 50-column table defined with varchar(20) fields that takes up 1TB in Oracle, it would take up 10TB in HBase compressed
- in Oracle you have a bitset index which is what fields are set, packed together
- in HBase you have key-values, your family name, column name, and timestamp for each column – you now have an I/O structure that doesn't work
- customers have complained about HBase being slow – Leach inquires as to how they serialized the data, and they basically say that they took whatever was in their relational database and jammed it into each column in HBase
- it's slow because it takes a long time to scan 10TB of data – you can't fix I/O – once it's broken, it's broken
- with compression and replication, they cancel each other out and so Splice Machine stores 1TB of data with 3x compression, 3x replication
- this is one area of Hadoop architecture that isn't discussed very often – if you want low latency, I/O matters – you need to try to use as little data on disk as possible, because you have to read it
- Leach reiterated that both compression and serialization is important here
- storing the data in sortable bytes provides the ability to perform byte level comparisons
- if you are going to build a database on top of Hadoop, serialization is a really important piece – how big is the data on disk? how are predicates being applied? – byte level comparisons or conversion to objects, in which case you will have heap problems because you are creating a lot of objects
- in response to a question from the audience, Leach explained that the way databases are moving forward, they are no longer OLTP or OLAP – they are hybrids – data comes in row-based and stored in some sort of columnar structure – the best paper on this subject is probably the one on C-Store by Stonebraker, which was commercialized by Vertica
- Leach explains that he cringes when people talk about row-based and column-based databases, because a coherent database should really be both
- it's not even columnar any more – it's whether you've already computed the data, already joined the data
- if you think about ETL that is done at night, followed by reads during the day, the data should really be converted to a columnar format if it's not being written to – it provides better compression, better I/O, faster speeds
- so what did the Splice Machine team do with HBase?
- the biggest was the asynchronous write pipeline
- writes to HBase are "terrible" – they block
- so if you read 1k records and want to send them across the wire, while you send them you can't read any more records
- this is just not workable – a database needs to have an asynchronous write pipeline, where you can read, say, 1m records as fast as possible while at the same time 10 concurrent writes are being performed at the same time, and the writes are flushable
- it is really important to understand some of these low level details when the slowness of a database starts being questioned
- another thing that the Splice Machine team did was to implement concurrent writes of data, indexes, and index constraints
- Leach also reiterated that writes are batched in chunks, an important feature – so if you're writing 100 records in HBase, make sure it does one WALEdit versus 100 fsyncs, because that would be a huge penalty
- someone from the audience inquired as to whether Splice Machine provides support for YARN resource management, and Leach indicated that this subject often comes up in sales deals
- yes, you can run HBase on YARN – but the Splice Machine resource manager is different, it is modeled after the Linux scheduling concept, a task stealing queue
- in YARN, you say you need this much memory, this much CPU – Splice Machine scheduling is much more about how many tasks can run in the DML queue, how many tasks can run in the DDL queue, and a queue always needs to be open for dictionary operations
- so if I'm running a 10b record x 10b record join, and Splice Machine is configured to use the complete capacity of the cluster, when you connect you should still be able to show tables, show indexes, show the dictionary
- Splice Machine supports Apache Mesos and YARN, but it has a governor for writes
- one scenario involved a consultant wanting to build a 10b record index, and the node just kept failing – sure a lot can be written, but HBase just fell over because it was too much – they tried to write something like 300k writes per second, and HBase couldn't handle that on each of the nodes
- so a governor needed to be written as part of the Splice Machine resource manager that indicates the most writes that a node can accept, and if it's more than what has been configured, tell it to back off and send a message back to the pipeline to provide a momentary delay and then try to resend
- one focus of the Splice Machine team has been stability – the only governor HBase provided was the IPC server threads – essentially a resource pool that once filled you could no longer do anything with HBase other than queueing, waiting
- for Splice Machine, a message instead gets sent back to the client when a limit is reached to indicate what should be done
- when Splice Machine comes up, 76 dictionary tables are created in HBase
- in response to a question from the audience, Leach reminded that in HBase you never update, you insert with a timestamp, and with snapshot isolation you can always see your data at a point in time – it's a temporal database – Oracle has this capability as well, called Flashback – everything is tombstone based, so if you drop a column you are really just adding tombstones
- once you have snapshot isolation, incremental backups are really easy because you can go back to a point in time to see what the data looked like
- someone from the audience commented that this is based on how many versions you tell HBase to save, and Leach replied that with Splice Machine all are saved – after discussing some related topics, he also mentioned a "manual vacuum command" that you can call that will cleanse a table back to the specified timestamp
- Leach indicated that he thinks this is a terrible slide
- as a general rule, the Splice Machine team doesn't share low-level details of the product
- but there are some great papers out there on which the product is based
- links to the research mentioned here are provided at the top of this blog post
- Leach also mentioned the paper "A Critique of Snapshot Isolation" by Daniel Gómez Ferro, who now works for Splice Machine
- Chen Zhang worked on a project called HBaseHI, and now works for Splice Machine as well
- Leach thought that this slide might come across as too salesy
- if Leach goes into a deal, and he's competing against Impala, he simply realizes that it isn't Splice Machine space
- the Splice Machine space is more of an Oracle, PostgreSQL, and DB2 space – where you're updating data – generally "big iron" companies
- Leach indicated that the way he looks at the database market is a bit different
- there are analytics-based approaches, and operational-based approaches, with the latter likely consisting of both OLTP and OLAP
- there is also the SQL-on-Hadoop space and the SQL-on-HBase space that includes products like Phoenix
- Leach suggests checking back with products in the SQL-on-Hadoop space every 3 to 6 months, due to the drastic changes these products are experiencing over time – "it's definitely an arm's race"
- the higher-cost products are from established players – but the one nice thing about them is that they work
- we all laugh, but they're good products – Oracle works, for example – and it will scale up – but it's expensive
- Leach also mentioned Teradata, which is very expensive, but he recalls talking to a Teradata customer who really likes the product as well
- there are a lot of products out there that work very well and are not on Hadoop, but they are low risk
- the biggest challenge for all the lower-cost players at the top of the grid in this slide is to be able to demonstrate effectiveness and get to more sophisticated capabilities, whether on the analytics or transactional side
- in closing, Leach ran a very brief demo, during which time he posed the question – "what's so special about Splice Machine?"
- to help provide an answer, he ran some commands at the Splice Machine command line, which deleted records and then rolled back
- Leach explained that although running these commands looks very simple, what it shows is the ability to run an application with the database due to the transactions that it provides
- in response to a question from the audience, Leach brought up Oracle Flashback again, which has 6 or 7 different features, one of which is to run a query from a specific point in time, and another which shows all data versions and changes – this functionality hasn't been a priority for Splice Machine as of yet
- Leach indicated that there is only one customer so far – a SAAS company involved in sales trending – which really wants this type of functionality, although he has seen use cases for it, such as in trading and insurance
- Leach would like to see a better way to do temporal data, but he hasn't seen it yet
- Splice Machine will have Flashback functionality by the end of the year (2014), although Leach indicated that it will be interesting to evaluate performance under different scenarios