By Erik Gfesser — Jul 1, 2016

Chicago Area Hadoop User Group (CHUG):A Night of SQL on Hadoop (June 21, 2016)

June 21, 2016
5:30 PM – 7:00 PM

From the promotional materials:

'SQL on Hadoop'

As more corporations are moving to mainstream their Hadoop clusters, there is an increased interest in providing Business Intelligence (BI) functionality. SQL on Hadoop is one of the top interests as corporations try to gain more value from their data lakes.

Please join us for a talk that introduces us to the concept of ‘SQL on Hadoop’, the various vendors and their query engines, and how to be effective producing OLAP cubes on Hadoop.

About the Speaker:

Chris Oshiro is a 17 year veteran in the BI space. Chris brings a unique view of the software landscape as a result of wearing several hats during his career. Mr. Oshiro has worked for 5 different analytics companies and currently runs the pre-sales organization for AtScale Inc.

Chris Oshiro didn't make it, as his client pulled him away
Bob Kelly, also from AtScale, was his replacement for the talk, and has been doing business intelligence for 23 years
Bob started using Crystal Reports, then "Forest and Trees"
he joked that you could create drop-downs using SQL, a big deal back then
after starting some companies, he started working for other companies such as Platfora and AtScale
originally, they were creating OLAP cubes outside the database, creating redundant data
then they started using materialized views, pushing data back into the database
he also remembers using "InfoBeacon" (apparently later renamed "DecisionBase") at some point
Bob is a field engineer, not a sales engineer like Chris
he has been working with "real" big data use cases in the field for companies like Uber
according to Bob, the agenda for this talk is (1) Hadoop adoption (a reality check), (2) SQL on Hadoop engines benchmark, and (3) best practices on software design for Hadoop adoption
Bob evaluates clients as to what stage they are at with regard to Hadoop adoption – clients are typically not as far along as they think
the original Hadoop work being done really wasn't business intelligence – it was trying to address specific use cases
now, companies are really moving to putting business intelligence onto Hadoop – it's really just starting to get popular
in response to a question from Boris Lublinsky, one of the CHUG organizers, around the definition of "business intelligence", Bob clarified that to him, business intelligence involves some aspect of users hitting the data directly with a user interface such as Tableau – the ability to issue real-time queries is also an important aspect
Boris responded by saying that 90% of users just use Hive – that's it – everyone seems to be stuck on SQL queries and they don't want to do anything else
Bob went on to say that he has worked with 20 clients over the last 3 years, and sided with Chris (the original writer of the presentation) that business intelligence is still a new focus
Boris continued to argue, saying that Spark 2.0 books are available etc – and Bob countered by saying that clients are not actually using these types of tools in production
Bob clarified his stance by saying that companies have been using Hadoop for data science and ETL, but not business intelligence
according to Bob, data engineers and data scientists have been the primary users of Hadoop – one of the reasons he moved from Platfora to AtScale
as Hadoop clusters are being used longer, companies are looking to move to self-service
the number one problem seems to be user skillset
how can the market correct this situation? by letting users use existing skillsets
no single SQL on Hadoop solution is the best for all queries
what business intelligence folks immediately want to do is create star schemas with multiple joins etc – this causes Hadoop performance issues
Spark SQL performs best for multi-join queries
processing on Spark displays significant performance gains at the expense of needing high amounts of memory
AtScale makes an effort at keeping data where it is
the iterative nature of the way data warehousing works minimizes the need for data extracts etc
most Hadoop clusters out there are a mess – people do not know what they are doing

when a query comes in, AtScale creates an aggregate in its cache
two things come into play: the facts that should be included in the aggregate, and do not create the aggregate at too high a level
in response to a question from the audience, the speaker clarified by saying that the "adaptive cache" is really just a set of aggregates that sit in HDFS
it uses a series of API calls to refresh the aggregates which take place at the end of the ETL process
in response to an argument from Boris, Bob responded by saying that you can kick off rebuilds (refreshes) of all aggregates, but it's much easier to be selective when you know the data
Bob demonstrated what is called the "AtScale Design Center"
the process that takes place here – mapping, building cubes etc – essentially happens once
he commented that the demonstration he was running was over his cell phone
according to Bob, building models here is similar to using ERwin etc
AtScale only accesses data in the Hive warehouse
the primary push right now is to support non-Hadoop databases – today, AtScale only works with Hive in Hadoop
what he has seen is that customers want to access data in other databases (e.g. Teradata etc) and join this data with Hadoop data
someone in the audience asked whether the Hadoop cache is used by AtScale
in response, Bob indicated that the first question to ask is where are the aggregates to be stored, and the second question to ask is where the joins are to be executed
one of the core AtScale constructs is to leave the data where it is, but he commented that the data needs to be moved to perform operations such as this
according to a discussion Bob had with the AtScale CTO, it should be possible to hit any data source with AtScale
Bob demonstrated hitting a cube from Tableau
according to Bob, one of the "neat" features of AtScale is the "side car" – essentially a separate window in Chrome that shows the actual query being executed
in the demonstration, he emphasized how the query hit an aggregate, so it executed more quickly than it would have otherwise
he clarified by saying that AtScale will often clean statement issues from Tableau etc – such as getting rid of CASE stements etc
according to Bob, the aggregate rebuild is really the only maintenance step – he clarified by saying that if a refresh is done, only the pertinent portion is rebuilt
the product used to have a REST API, but hitting it didn't indicate whether it was successful – just that the request was received – so he advocates use of their Java API instead
in addition, AtScale prefers that all calculations make use of cubes – mistakes are often made with Tableau
he referred to a database table at American Express that had 275k columns in it, but it was for used for data science – this is not the type of work toward which AtScale is geared
someone in the audience mentioned a scenario involving quants, and Bob responded by saying that the data science folks use R and Pig
AtScale doesn't make use of EMR at all – it would just run on an edge box
the EMR that Hadoop makes use of doesn't go away, but AtScale doesn't provide hooks into it
Bob commented that they haven't figured out how to join between database products yet
at Sears, he saw that their goal was to completely replace all usage of Teradata with Hadoop
running on Tableau is still a new notion – according to Bob, it didn't do so very well until AtScale
he commented that it is common for folks moving away from well tuned RDBMS's to Hadoop to complain about performance
with the API provided by AtScale, all metadata can be accessed – an attempt is being made to be as agnostic as possible so that they don't end up customizing for products like Tableau
someone in the audience asked about support for Microsoft Excel, and he said they support it, unusual because he doesn't see this particular support provided very often
someone else asked about Azure support, and he responded by saying "You know how many clients I've been to who use Azure? Zero!" – then he backtracked and said "One"
in response, this audience member got defensive and said that his product is available on both AWS and Azure
Bob commented that Azure credits are literally being given away since AWS has such a stronghold, but perhaps it might not be the same situation down the road
after Bob said that "Excel is the number one most popular business intelligence tool in the world today", I just shook my head
he followed up by saying that "If I were starting a business today I would not be advocating use of Excel"
Boris commented by saying that it is "so scary" that we are limiting ourselves to tools 30 years old

Subscribe to Erik on Software