By Erik Gfesser — Feb 24, 2017

Chicago Apache Flink Meetup (CHAF):Hands-on Apache Flink Workshop (February 21, 2017)

From the promotional materials:

In this workshop, we will go over a hands on tutorial highlighting building a Flink stream processor.

The workshop will show you how to

1. Set up a stream consisting of Chicago Transit Authority (CTA) bus tracker data

2. Use different types of windowing techniques to generate metrics on the CTA

3. Sink the metrics to a queryable data sink

4. Turn all of the above into a deployable job that can be scaled up.

You will see all of Flink's major components in action: streams, sinks, windows, triggers, watermarks, etc.

A preconfigured virtual machine containing the workshop materials will be available for download prior to the meetup.

Pizza and Soda will be provided!

Joe Olson is a local area developer specializing in applying emerging technology to solve business problems, and is currently focusing on understanding the rapidly developing streaming space.

My personal notes:

Joe has been working with Flink for 1 to 1.5 years now
the only way to get community going is to get folks involved
most meetup groups consist of PowerPoint slides
streaming is hot right now…big data, IoT, and real time ML applying the pressure
there are currently several emerging frameworks and philosophies that will likely converge at some point
Flink is where Hadoop was back around 2009
one way to possibly jump start adoption is through some hackathons
what we are about to do:
– ingest CTA bus tracker historical data set
– use the Flink streaming API to transform this data set into a Kafka topic…this will provide a good overview of the basic streaming model
– view the data inside Kafka
– use a separate Flink job to process the data
– load the data into keyed windows
– fire the window on count or on timeout
– aggregate the data when fired
– sink the aggregated data
it took Joe a long time to figure all of this out, so he wants to pass all of this on to folks
the VM is Ubuntu 16.04, Flink 1.2, Kafka 0.9, Kafkatool, IntelliJ 2016.3 Community Version, Scala 2.11, Java 8, Gradle 3.2
the data consists of one day's worth of CTA data in JSON format
Joe commented that it is important to know the version of Scala being used when using Flink, otherwise it won't work
the CTA has a public API that returns minute fixes of all of its buses
a fix: {"vid": 1958, "tmstmp": "20150211 23:59", "lat": 41.880638122558594, "lon": -87.738800048828125, "hdg": 267, "pid": 949, "rt": "20", "des": "Austin", "pdist": 3429, "tablockid": "N20 -893", "tatripid": 1040830, "zone": "null”}
just under 1m fixes are created per day…about 250MB uncompressed
about 2 years of archived daily data also exists: https://github.com/jolson7168/ctaData
a good deal of Flink jobs fall into this pattern:
– connect to a data source and set up a stream
– do processing on each element in the stream
– agreggrating, filtering, buffering
sink results of processing (or start another stream)
at this point, Joe transitioned to demonstrations by first bringing up IntelliJ
the code is in Github…first do a git pull to refresh the VM, as he made some changes after creating the image
there are 2 sections of the project…a loader and a processor…the top of the project loads Kafka and the bottom processes the data
Joe commented that a lot of stuff will get thrown at you in the Flink documentation
typical steps for a job are to get data from Kafka, process the data, and then sink it
FixProcessor is a basic job
all of the build files are set up if you want to compile this into a JAR
the notes provide the Gradle build command and Flink start command that are needed
the Flink job manager can be accessed multiple ways, including via browser
Kafka is strange in that Zookeeper needs to be started separately prior to starting Kafka
Joe said that we can email or call him with issues, assistance etc with regard to this project
he noted that everything is installed to the Ubuntu VM home directory
a graphical way to see contents of Kafka is by using something called Kafka Tool
create a Kafka topic for output
there's not a lot you can do with the Flink UI…it's pretty primitive right now
deploy the JAR to the Flink server via the Flink UI after it is done building
Boris (one of the CHUG organizers) mentioned that zookeeperutils can be used to simplify
the JSON file is in the home directory
the CTA pushes out "half-assed XML files"…the provided JSON file is much better than the XML
Flink is kind of unique in that it works like SQL in the sense that it determines an optimal plan as to how to run each job
Joe commented that this is a very trivial job, but it will help you start understanding how Flink processes data
JSON fields that are not being used are discarded on the way to Kafka
letting this job run for a couple minutes will get up to 1m records
to get this job to run faster, "just throw more processors at it"…cluster it
job #1 takeaways:
– very simple Flink job
– connect to source (data file on OS)
– filter (make sure valid JSON)
– sink valid fixes to Kafka topic
– compile into a JAR
– deploy JAR on server
– observe execution
– check out the log files
Flink logs and you can load into Kibana to search them…don't mess around with the Flink logs directly themselves
Flink controls the connection to the source and the sink…the logging it's doing is better than anything you'll be able to do
it will save you a lot of heartache in the long run if you just get used to how Flink logs
Kafka takeaways
– start/stop Kafka and Zookeeper
– observe what is in a Kafka topic
– use Kafka for both the source and the sink for Flink
the new version of Kafka apparently replaces offsets with timestamps, which should be more useful
using Flink 1.2, which just came out…it supports Kafka .10
state management is better for 1.2
back to IntelliJ
Joe then progressed to the next job
need to get timestamp of each event
Joe commented that he didn't respect the boundary between event time and process time…when data comes in, you need to tell Flink the timestamp and the format it is in
once the timestamp is obtained, you have an anchor of when the event took place
Joe didn't understand the intricacies of events that took place in the past, so didn't get this job done
there's always 2 boundaries in Flink…the watermark is the latest point in time that Flink knows about
he he didn't know how to simulate this with historical data
this job shows how to get the timestamp and how to set the watermark, but doesn't work completely as of yet
there are 3 different window types and different strategies as to how you can use them
the windowing feature in Flink is one of it's greatest strengths
Kafka is starting to go in the same direction as Flink, but Flink has the jump on this by far
Boris said Kafka went in the opposite, simpler direction because it was too complex
we don't know who is going to win this contest, but the ideas will remain
try to follow all of the open source products in this space as to what they're trying to do, and you will get a good idea as to where the next 10 years will go
someone said that setting up Kafka was the hardest part…Boris argued that it is the easiest part (and I concur with Boris that setting up Kafka is simple)
the 2 things to take away is that there is a trigger as to when the window fires…you can override this with your own trigger
it took Joe a long time to figure out how to do this
there are different ways to handle late data, out of order data…Flink handles these types of cases well
for example, you could create a 1 hour window, but keep it open a few more minutes to catch any stragglers
next month's meetup will be grabbing CTA data in real time and applying windows against it
Flink will give you guarantees
if you are managing state, you can use RocksDB to…it is optimized for this…RocksDB is the heavy hitter for managing state
Boris commented that RocksDB is local to the CPU, and it confused the heck out of him…people want to call this state, but it isn't state
you will run into a lot of trouble if you don't play by the rules that Flink sets…for what it wasn't designed to do…they joked that this is because it's German
Data Artisans wants to make Flink the database itself…a queriable state feature
the table is a stream…the stream is a table
Boris joked that nobody knows what this means, but they all want to do it…Boris said that the stream is not a database, it's a log
Boris said that his "head spins around" when folks start talking about "infinite streaming" etc
for debugging purposes, you can just run within IntelliJ using Gradle
someone asked about unit testing strategies, and he said that they are struggling with this right now
Boris disagreed…he said that people struggled with Hadoop a few years ago in a similar way
job #2 takeaways:
– another simple flink job
– connect to source (JSON in Kafka)
– create a fix job from JSON
– identify the timestamp from the fix
– key the stream
– send fixes into windows based on key
– trigger the window by count or by timeout
– aggregate items in window on triggering
– sink aggregate results to PostgreSQL
– using a custom window
– using a custom trigger…preserving state
– using a custom aggregation
– sinking to a relational database
– run from within IntelliJ…useful for rapid debugging
Joe has Flink in production…doesn't know anyone else that does
Joe encouraged us to present at future meetup sessions, even if it is to present something small for 5 or 10 minutes

Subscribe to Erik on Software