Chicago Apache Flink Meetup (CHAF):Hands-on Apache Flink Workshop (February 21, 2017)

Apache_flink_logo


From the promotional materials: 

In this workshop, we will go over a hands on tutorial highlighting building a Flink stream processor.

The workshop will show you how to

1. Set up a stream consisting of Chicago Transit Authority (CTA) bus tracker data

2. Use different types of windowing techniques to generate metrics on the CTA

3. Sink the metrics to a queryable data sink

4. Turn all of the above into a deployable job that can be scaled up.

You will see all of Flink's major components in action: streams, sinks, windows, triggers, watermarks, etc.

A preconfigured virtual machine containing the workshop materials will be available for download prior to the meetup.

Pizza and Soda will be provided!

Joe Olson is a local area developer specializing in applying emerging technology to solve business problems, and is currently focusing on understanding the rapidly developing streaming space.


My personal notes:

  • Joe has been working with Flink for 1 to 1.5 years now
  • the only way to get community going is to get folks involved
  • most meetup groups consist of PowerPoint slides
  • streaming is hot right now…big data, IoT, and real time ML applying the pressure
  • there are currently several emerging frameworks and philosophies that will likely converge at some point
  • Flink is where Hadoop was back around 2009
  • one way to possibly jump start adoption is through some hackathons
  • what we are about to do:
  • – ingest CTA bus tracker historical data set
  • – use the Flink streaming API to transform this data set into a Kafka topic…this will provide a good overview of the basic streaming model
  • – view the data inside Kafka
  • – use a separate Flink job to process the data
  • – load the data into keyed windows
  • – fire the window on count or on timeout
  • – aggregate the data when fired
  • – sink the aggregated data
  • it took Joe a long time to figure all of this out, so he wants to pass all of this on to folks
  • the VM is Ubuntu 16.04, Flink 1.2, Kafka 0.9, Kafkatool, IntelliJ 2016.3 Community Version, Scala 2.11, Java 8, Gradle 3.2
  • the data consists of one day's worth of CTA data in JSON format
  • Joe commented that it is important to know the version of Scala being used when using Flink, otherwise it won't work
  • the CTA has a public API that returns minute fixes of all of its buses
  • a fix: {"vid": 1958, "tmstmp": "20150211 23:59", "lat": 41.880638122558594, "lon": -87.738800048828125, "hdg": 267, "pid": 949, "rt": "20", "des": "Austin", "pdist": 3429, "tablockid": "N20 -893", "tatripid": 1040830, "zone": "null”}
  • just under 1m fixes are created per day…about 250MB uncompressed
  • about 2 years of archived daily data also exists: https://github.com/jolson7168/ctaData
  • a good deal of Flink jobs fall into this pattern:
  • – connect to a data source and set up a stream
  • – do processing on each element in the stream
  • – agreggrating, filtering, buffering
  • sink results of processing (or start another stream)
  • at this point, Joe transitioned to demonstrations by first bringing up IntelliJ
  • the code is in Github…first do a git pull to refresh the VM, as he made some changes after creating the image
  • there are 2 sections of the project…a loader and a processor…the top of the project loads Kafka and the bottom processes the data
  • Joe commented that a lot of stuff will get thrown at you in the Flink documentation
  • typical steps for a job are to get data from Kafka, process the data, and then sink it
  • FixProcessor is a basic job
  • all of the build files are set up if you want to compile this into a JAR
  • the notes provide the Gradle build command and Flink start command that are needed
  • the Flink job manager can be accessed multiple ways, including via browser
  • Kafka is strange in that Zookeeper needs to be started separately prior to starting Kafka
  • Joe said that we can email or call him with issues, assistance etc with regard to this project
  • he noted that everything is installed to the Ubuntu VM home directory
  • a graphical way to see contents of Kafka is by using something called Kafka Tool
  • create a Kafka topic for output
  • there's not a lot you can do with the Flink UI…it's pretty primitive right now
  • deploy the JAR to the Flink server via the Flink UI after it is done building
  • Boris (one of the CHUG organizers) mentioned that zookeeperutils can be used to simplify
  • the JSON file is in the home directory
  • the CTA pushes out "half-assed XML files"…the provided JSON file is much better than the XML
  • Flink is kind of unique in that it works like SQL in the sense that it determines an optimal plan as to how to run each job
  • Joe commented that this is a very trivial job, but it will help you start understanding how Flink processes data
  • JSON fields that are not being used are discarded on the way to Kafka
  • letting this job run for a couple minutes will get up to 1m records
  • to get this job to run faster, "just throw more processors at it"…cluster it
  • job #1 takeaways:
  • – very simple Flink job
  • – connect to source (data file on OS)
  • – filter (make sure valid JSON)
  • – sink valid fixes to Kafka topic
  • – compile into a JAR
  • – deploy JAR on server
  • – observe execution
  • – check out the log files
  • Flink logs and you can load into Kibana to search them…don't mess around with the Flink logs directly themselves
  • Flink controls the connection to the source and the sink…the logging it's doing is better than anything you'll be able to do
  • it will save you a lot of heartache in the long run if you just get used to how Flink logs
  • Kafka takeaways
  • – start/stop Kafka and Zookeeper
  • – observe what is in a Kafka topic
  • – use Kafka for both the source and the sink for Flink
  • the new version of Kafka apparently replaces offsets with timestamps, which should be more useful
  • using Flink 1.2, which just came out…it supports Kafka .10
  • state management is better for 1.2
  • back to IntelliJ
  • Joe then progressed to the next job
  • need to get timestamp of each event
  • Joe commented that he didn't respect the boundary between event time and process time…when data comes in, you need to tell Flink the timestamp and the format it is in
  • once the timestamp is obtained, you have an anchor of when the event took place
  • Joe didn't understand the intricacies of events that took place in the past, so didn't get this job done
  • there's always 2 boundaries in Flink…the watermark is the latest point in time that Flink knows about
  • he he didn't know how to simulate this with historical data
  • this job shows how to get the timestamp and how to set the watermark, but doesn't work completely as of yet
  • there are 3 different window types and different strategies as to how you can use them
  • the windowing feature in Flink is one of it's greatest strengths
  • Kafka is starting to go in the same direction as Flink, but Flink has the jump on this by far
  • Boris said Kafka went in the opposite, simpler direction because it was too complex
  • we don't know who is going to win this contest, but the ideas will remain
  • try to follow all of the open source products in this space as to what they're trying to do, and you will get a good idea as to where the next 10 years will go
  • someone said that setting up Kafka was the hardest part…Boris argued that it is the easiest part (and I concur with Boris that setting up Kafka is simple)
  • the 2 things to take away is that there is a trigger as to when the window fires…you can override this with your own trigger
  • it took Joe a long time to figure out how to do this
  • there are different ways to handle late data, out of order data…Flink handles these types of cases well
  • for example, you could create a 1 hour window, but keep it open a few more minutes to catch any stragglers
  • next month's meetup will be grabbing CTA data in real time and applying windows against it
  • Flink will give you guarantees
  • if you are managing state, you can use RocksDB to…it is optimized for this…RocksDB is the heavy hitter for managing state
  • Boris commented that RocksDB is local to the CPU, and it confused the heck out of him…people want to call this state, but it isn't state
  • you will run into a lot of trouble if you don't play by the rules that Flink sets…for what it wasn't designed to do…they joked that this is because it's German
  • Data Artisans wants to make Flink the database itself…a queriable state feature
  • the table is a stream…the stream is a table
  • Boris joked that nobody knows what this means, but they all want to do it…Boris said that the stream is not a database, it's a log
  • Boris said that his "head spins around" when folks start talking about "infinite streaming" etc
  • for debugging purposes, you can just run within IntelliJ using Gradle
  • someone asked about unit testing strategies, and he said that they are struggling with this right now
  • Boris disagreed…he said that people struggled with Hadoop a few years ago in a similar way
  • job #2 takeaways:
  • – another simple flink job
  • – connect to source (JSON in Kafka)
  • – create a fix job from JSON
  • – identify the timestamp from the fix
  • – key the stream
  • – send fixes into windows based on key
  • – trigger the window by count or by timeout
  • – aggregate items in window on triggering
  • – sink aggregate results to PostgreSQL
  • – using a custom window
  • – using a custom trigger…preserving state
  • – using a custom aggregation
  • – sinking to a relational database
  • – run from within IntelliJ…useful for rapid debugging
  • Joe has Flink in production…doesn't know anyone else that does
  • Joe encouraged us to present at future meetup sessions, even if it is to present something small for 5 or 10 minutes

Subscribe to Erik on Software

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe