Chicago Apache Flink Meetup (CHAF):Hands-on Apache Flink Workshop (February 21, 2017)
From the promotional materials:
In this workshop, we will go over a hands on tutorial highlighting building a Flink stream processor.
The workshop will show you how to
1. Set up a stream consisting of Chicago Transit Authority (CTA) bus tracker data
2. Use different types of windowing techniques to generate metrics on the CTA
3. Sink the metrics to a queryable data sink
4. Turn all of the above into a deployable job that can be scaled up.
You will see all of Flink's major components in action: streams, sinks, windows, triggers, watermarks, etc.
A preconfigured virtual machine containing the workshop materials will be available for download prior to the meetup.
Pizza and Soda will be provided!
Joe Olson is a local area developer specializing in applying emerging technology to solve business problems, and is currently focusing on understanding the rapidly developing streaming space.
My personal notes:
- Joe has been working with Flink for 1 to 1.5 years now
- the only way to get community going is to get folks involved
- most meetup groups consist of PowerPoint slides
- streaming is hot right now…big data, IoT, and real time ML applying the pressure
- there are currently several emerging frameworks and philosophies that will likely converge at some point
- Flink is where Hadoop was back around 2009
- one way to possibly jump start adoption is through some hackathons
- what we are about to do:
- – ingest CTA bus tracker historical data set
- – use the Flink streaming API to transform this data set into a Kafka topic…this will provide a good overview of the basic streaming model
- – view the data inside Kafka
- – use a separate Flink job to process the data
- – load the data into keyed windows
- – fire the window on count or on timeout
- – aggregate the data when fired
- – sink the aggregated data
- it took Joe a long time to figure all of this out, so he wants to pass all of this on to folks
- the VM is Ubuntu 16.04, Flink 1.2, Kafka 0.9, Kafkatool, IntelliJ 2016.3 Community Version, Scala 2.11, Java 8, Gradle 3.2
- the data consists of one day's worth of CTA data in JSON format
- Joe commented that it is important to know the version of Scala being used when using Flink, otherwise it won't work
- the CTA has a public API that returns minute fixes of all of its buses
- a fix: {"vid": 1958, "tmstmp": "20150211 23:59", "lat": 41.880638122558594, "lon": -87.738800048828125, "hdg": 267, "pid": 949, "rt": "20", "des": "Austin", "pdist": 3429, "tablockid": "N20 -893", "tatripid": 1040830, "zone": "null”}
- just under 1m fixes are created per day…about 250MB uncompressed
- about 2 years of archived daily data also exists: https://github.com/jolson7168/ctaData
- a good deal of Flink jobs fall into this pattern:
- – connect to a data source and set up a stream
- – do processing on each element in the stream
- – agreggrating, filtering, buffering
- sink results of processing (or start another stream)
- at this point, Joe transitioned to demonstrations by first bringing up IntelliJ
- the code is in Github…first do a git pull to refresh the VM, as he made some changes after creating the image
- there are 2 sections of the project…a loader and a processor…the top of the project loads Kafka and the bottom processes the data
- Joe commented that a lot of stuff will get thrown at you in the Flink documentation
- typical steps for a job are to get data from Kafka, process the data, and then sink it
- FixProcessor is a basic job
- all of the build files are set up if you want to compile this into a JAR
- the notes provide the Gradle build command and Flink start command that are needed
- the Flink job manager can be accessed multiple ways, including via browser
- Kafka is strange in that Zookeeper needs to be started separately prior to starting Kafka
- Joe said that we can email or call him with issues, assistance etc with regard to this project
- he noted that everything is installed to the Ubuntu VM home directory
- a graphical way to see contents of Kafka is by using something called Kafka Tool
- create a Kafka topic for output
- there's not a lot you can do with the Flink UI…it's pretty primitive right now
- deploy the JAR to the Flink server via the Flink UI after it is done building
- Boris (one of the CHUG organizers) mentioned that zookeeperutils can be used to simplify
- the JSON file is in the home directory
- the CTA pushes out "half-assed XML files"…the provided JSON file is much better than the XML
- Flink is kind of unique in that it works like SQL in the sense that it determines an optimal plan as to how to run each job
- Joe commented that this is a very trivial job, but it will help you start understanding how Flink processes data
- JSON fields that are not being used are discarded on the way to Kafka
- letting this job run for a couple minutes will get up to 1m records
- to get this job to run faster, "just throw more processors at it"…cluster it
- job #1 takeaways:
- – very simple Flink job
- – connect to source (data file on OS)
- – filter (make sure valid JSON)
- – sink valid fixes to Kafka topic
- – compile into a JAR
- – deploy JAR on server
- – observe execution
- – check out the log files
- Flink logs and you can load into Kibana to search them…don't mess around with the Flink logs directly themselves
- Flink controls the connection to the source and the sink…the logging it's doing is better than anything you'll be able to do
- it will save you a lot of heartache in the long run if you just get used to how Flink logs
- Kafka takeaways
- – start/stop Kafka and Zookeeper
- – observe what is in a Kafka topic
- – use Kafka for both the source and the sink for Flink
- the new version of Kafka apparently replaces offsets with timestamps, which should be more useful
- using Flink 1.2, which just came out…it supports Kafka .10
- state management is better for 1.2
- back to IntelliJ
- Joe then progressed to the next job
- need to get timestamp of each event
- Joe commented that he didn't respect the boundary between event time and process time…when data comes in, you need to tell Flink the timestamp and the format it is in
- once the timestamp is obtained, you have an anchor of when the event took place
- Joe didn't understand the intricacies of events that took place in the past, so didn't get this job done
- there's always 2 boundaries in Flink…the watermark is the latest point in time that Flink knows about
- he he didn't know how to simulate this with historical data
- this job shows how to get the timestamp and how to set the watermark, but doesn't work completely as of yet
- there are 3 different window types and different strategies as to how you can use them
- the windowing feature in Flink is one of it's greatest strengths
- Kafka is starting to go in the same direction as Flink, but Flink has the jump on this by far
- Boris said Kafka went in the opposite, simpler direction because it was too complex
- we don't know who is going to win this contest, but the ideas will remain
- try to follow all of the open source products in this space as to what they're trying to do, and you will get a good idea as to where the next 10 years will go
- someone said that setting up Kafka was the hardest part…Boris argued that it is the easiest part (and I concur with Boris that setting up Kafka is simple)
- the 2 things to take away is that there is a trigger as to when the window fires…you can override this with your own trigger
- it took Joe a long time to figure out how to do this
- there are different ways to handle late data, out of order data…Flink handles these types of cases well
- for example, you could create a 1 hour window, but keep it open a few more minutes to catch any stragglers
- next month's meetup will be grabbing CTA data in real time and applying windows against it
- Flink will give you guarantees
- if you are managing state, you can use RocksDB to…it is optimized for this…RocksDB is the heavy hitter for managing state
- Boris commented that RocksDB is local to the CPU, and it confused the heck out of him…people want to call this state, but it isn't state
- you will run into a lot of trouble if you don't play by the rules that Flink sets…for what it wasn't designed to do…they joked that this is because it's German
- Data Artisans wants to make Flink the database itself…a queriable state feature
- the table is a stream…the stream is a table
- Boris joked that nobody knows what this means, but they all want to do it…Boris said that the stream is not a database, it's a log
- Boris said that his "head spins around" when folks start talking about "infinite streaming" etc
- for debugging purposes, you can just run within IntelliJ using Gradle
- someone asked about unit testing strategies, and he said that they are struggling with this right now
- Boris disagreed…he said that people struggled with Hadoop a few years ago in a similar way
- job #2 takeaways:
- – another simple flink job
- – connect to source (JSON in Kafka)
- – create a fix job from JSON
- – identify the timestamp from the fix
- – key the stream
- – send fixes into windows based on key
- – trigger the window by count or by timeout
- – aggregate items in window on triggering
- – sink aggregate results to PostgreSQL
- – using a custom window
- – using a custom trigger…preserving state
- – using a custom aggregation
- – sinking to a relational database
- – run from within IntelliJ…useful for rapid debugging
- Joe has Flink in production…doesn't know anyone else that does
- Joe encouraged us to present at future meetup sessions, even if it is to present something small for 5 or 10 minutes