Chicago Area Hadoop User Group (CHUG):A Night of SQL on Hadoop (June 21, 2016)
June 21, 2016
5:30 PM – 7:00 PM
From the promotional materials:
'SQL on Hadoop'
As more corporations are moving to mainstream their Hadoop clusters, there is an increased interest in providing Business Intelligence (BI) functionality. SQL on Hadoop is one of the top interests as corporations try to gain more value from their data lakes.
Please join us for a talk that introduces us to the concept of ‘SQL on Hadoop’, the various vendors and their query engines, and how to be effective producing OLAP cubes on Hadoop.
About the Speaker:
Chris Oshiro is a 17 year veteran in the BI space. Chris brings a unique view of the software landscape as a result of wearing several hats during his career. Mr. Oshiro has worked for 5 different analytics companies and currently runs the pre-sales organization for AtScale Inc.
- Chris Oshiro didn't make it, as his client pulled him away
- Bob Kelly, also from AtScale, was his replacement for the talk, and has been doing business intelligence for 23 years
- Bob started using Crystal Reports, then "Forest and Trees"
- he joked that you could create drop-downs using SQL, a big deal back then
- after starting some companies, he started working for other companies such as Platfora and AtScale
- originally, they were creating OLAP cubes outside the database, creating redundant data
- then they started using materialized views, pushing data back into the database
- he also remembers using "InfoBeacon" (apparently later renamed "DecisionBase") at some point
- Bob is a field engineer, not a sales engineer like Chris
- he has been working with "real" big data use cases in the field for companies like Uber
- according to Bob, the agenda for this talk is (1) Hadoop adoption (a reality check), (2) SQL on Hadoop engines benchmark, and (3) best practices on software design for Hadoop adoption
- Bob evaluates clients as to what stage they are at with regard to Hadoop adoption – clients are typically not as far along as they think
- the original Hadoop work being done really wasn't business intelligence – it was trying to address specific use cases
- now, companies are really moving to putting business intelligence onto Hadoop – it's really just starting to get popular
- in response to a question from Boris Lublinsky, one of the CHUG organizers, around the definition of "business intelligence", Bob clarified that to him, business intelligence involves some aspect of users hitting the data directly with a user interface such as Tableau – the ability to issue real-time queries is also an important aspect
- Boris responded by saying that 90% of users just use Hive – that's it – everyone seems to be stuck on SQL queries and they don't want to do anything else
- Bob went on to say that he has worked with 20 clients over the last 3 years, and sided with Chris (the original writer of the presentation) that business intelligence is still a new focus
- Boris continued to argue, saying that Spark 2.0 books are available etc – and Bob countered by saying that clients are not actually using these types of tools in production
- Bob clarified his stance by saying that companies have been using Hadoop for data science and ETL, but not business intelligence
- according to Bob, data engineers and data scientists have been the primary users of Hadoop – one of the reasons he moved from Platfora to AtScale
- as Hadoop clusters are being used longer, companies are looking to move to self-service
- the number one problem seems to be user skillset
- how can the market correct this situation? by letting users use existing skillsets
- no single SQL on Hadoop solution is the best for all queries
- what business intelligence folks immediately want to do is create star schemas with multiple joins etc – this causes Hadoop performance issues
- Spark SQL performs best for multi-join queries
- processing on Spark displays significant performance gains at the expense of needing high amounts of memory
- AtScale makes an effort at keeping data where it is
- the iterative nature of the way data warehousing works minimizes the need for data extracts etc
- most Hadoop clusters out there are a mess – people do not know what they are doing
- when a query comes in, AtScale creates an aggregate in its cache
- two things come into play: the facts that should be included in the aggregate, and do not create the aggregate at too high a level
- in response to a question from the audience, the speaker clarified by saying that the "adaptive cache" is really just a set of aggregates that sit in HDFS
- it uses a series of API calls to refresh the aggregates which take place at the end of the ETL process
- in response to an argument from Boris, Bob responded by saying that you can kick off rebuilds (refreshes) of all aggregates, but it's much easier to be selective when you know the data
- Bob demonstrated what is called the "AtScale Design Center"
- the process that takes place here – mapping, building cubes etc – essentially happens once
- he commented that the demonstration he was running was over his cell phone
- according to Bob, building models here is similar to using ERwin etc
- AtScale only accesses data in the Hive warehouse
- the primary push right now is to support non-Hadoop databases – today, AtScale only works with Hive in Hadoop
- what he has seen is that customers want to access data in other databases (e.g. Teradata etc) and join this data with Hadoop data
- someone in the audience asked whether the Hadoop cache is used by AtScale
- in response, Bob indicated that the first question to ask is where are the aggregates to be stored, and the second question to ask is where the joins are to be executed
- one of the core AtScale constructs is to leave the data where it is, but he commented that the data needs to be moved to perform operations such as this
- according to a discussion Bob had with the AtScale CTO, it should be possible to hit any data source with AtScale
- Bob demonstrated hitting a cube from Tableau
- according to Bob, one of the "neat" features of AtScale is the "side car" – essentially a separate window in Chrome that shows the actual query being executed
- in the demonstration, he emphasized how the query hit an aggregate, so it executed more quickly than it would have otherwise
- he clarified by saying that AtScale will often clean statement issues from Tableau etc – such as getting rid of CASE stements etc
- according to Bob, the aggregate rebuild is really the only maintenance step – he clarified by saying that if a refresh is done, only the pertinent portion is rebuilt
- the product used to have a REST API, but hitting it didn't indicate whether it was successful – just that the request was received – so he advocates use of their Java API instead
- in addition, AtScale prefers that all calculations make use of cubes – mistakes are often made with Tableau
- he referred to a database table at American Express that had 275k columns in it, but it was for used for data science – this is not the type of work toward which AtScale is geared
- someone in the audience mentioned a scenario involving quants, and Bob responded by saying that the data science folks use R and Pig
- AtScale doesn't make use of EMR at all – it would just run on an edge box
- the EMR that Hadoop makes use of doesn't go away, but AtScale doesn't provide hooks into it
- Bob commented that they haven't figured out how to join between database products yet
- at Sears, he saw that their goal was to completely replace all usage of Teradata with Hadoop
- running on Tableau is still a new notion – according to Bob, it didn't do so very well until AtScale
- he commented that it is common for folks moving away from well tuned RDBMS's to Hadoop to complain about performance
- with the API provided by AtScale, all metadata can be accessed – an attempt is being made to be as agnostic as possible so that they don't end up customizing for products like Tableau
- someone in the audience asked about support for Microsoft Excel, and he said they support it, unusual because he doesn't see this particular support provided very often
- someone else asked about Azure support, and he responded by saying "You know how many clients I've been to who use Azure? Zero!" – then he backtracked and said "One"
- in response, this audience member got defensive and said that his product is available on both AWS and Azure
- Bob commented that Azure credits are literally being given away since AWS has such a stronghold, but perhaps it might not be the same situation down the road
- after Bob said that "Excel is the number one most popular business intelligence tool in the world today", I just shook my head
- he followed up by saying that "If I were starting a business today I would not be advocating use of Excel"
- Boris commented by saying that it is "so scary" that we are limiting ourselves to tools 30 years old