By Erik Gfesser — Sep 7, 2016

Workshop Notes (August 27, 2016): End-to-End Streaming ML Recommendation Pipeline (Spark 2.0, Kafka, TensorFlow)

August 27, 2016
9:00 AM – 5:00 PM

From the promotional materials:

END-TO-END STREAMING ML RECOMMENDATION PIPELINE WORKSHOP

Learn to build an end-to-end, streaming recommendations pipeline using the latest streaming analytics tools inside a portable, take-home Docker Container in the cloud!

DATE     Saturday, August 27, 2016
TIME     9:00 AM Central Standard Time
VENUE    2.03 Classroom

You’ll learn how to:

Create a complete, end-to-end streaming data analytics pipeline
Interactively analyze, approximate, and visualize streaming data
Generate machine learning, graph & NLP recommendation models
Productionize our ML models to serve real-time recommendations
Perform a hybrid on-premise and cloud deployment using Docker
Customize this workshop environment to your specific use cases

Agenda:

Part 1 (Analytics and Visualizations)

Analytics and Visualizations Overview (Live Demo!)
Verify Environment Setup (Docker)
Notebooks (Zeppelin, Jupyter/iPython)
Interactive Data Analytics (Spark SQL, Hive, Presto)
Graph Analytics (Spark Graph, NetworkX, TitanDB)
Time-series Analytics (Cassandra)
Visualizations (Kibana, Matplotlib, D3)
Approximate Queries (Spark SQL, Redis, Algebird)
Workflow Management (AirFlow)

Part 2 (Streaming and Recommendations)

Streaming and Recommendations Overview (Live Demo!)
Streaming (NiFi, Kafka, Spark Streaming, Flink)
Cluster-based Recommendation (Spark ML, Scikit-Learn)
Graph-based Recommendation (Spark ML, Spark Graph)
Collaborative-based Recommendation (Spark ML)
NLP-based Recommendation (CoreNLP, NLTK)
Geo-based Recommendation (ElasticSearch)
Hybrid On-Premise+Cloud Auto-Scale Deploy (Docker)
Customize the Workshop Environment for Your Use Cases

Target Audience:

Interest in learning more about the streaming data pipelines that power their real-time machine learning models and visualizations
Interest in building more intuition about machine learning, graph processing, natural language processing, statistical approximation techniques, and visualizations
Interest in learning the practical applications of a modern, streaming data analytics and recommendations pipeline
Anyone who wants to try 3D-printed PANCAKES!!

Prerequisites:

Basic familiarity with Unix/Linux commands
Experience in SQL, Java, Scala, Python, or R
Basic familiarity with linear algebra concepts (dot product)
Laptop with an ssh client and modern browser
Every attendee will get their own fully-configured cloud instance running the entire environment
At the end of the workshop, you will be able to save and download your environment to your local laptop in the form of a Docker image

My personal notes:

the workshop description can be found at this link
there were between 80 and 90 attendees to this event
Chris had originally booked another location, but moved to 1871 due to the demand
yes…I realize that the photo of Chris above isn't very clear…I was in the back row using my Android phone camera
attendee backgrounds were mixed…some were from startups…the individuals in my area of the workshop work at Cloudera and Uptake
Chris will be keynoting at Big Data Spain this coming November
the PANCAKE STACK comes from folks in the Bay area, and replaces/updates an earlier stack
about 35 technologies will be introduced during this workshop
Chris asked everyone if they understood the limitations of Apache HiveQL
Apache Spark now supports the SQL:2003 specification
the workshop helpers today are Christina, Mark, and Ross
Mark and Ross are former colleagues of Chris
due to workshop time constraints, specific use cases will not be covered
will talk about clustering (unsupervised; without label), and classification (supervised; with label)
the whole point of building machine learning is to provide recommendations etc
Chris worked for Netflix from 2010-2012…work was all about recommendations and collaborative filtering
the Netflix star rating system is useless due to all of the bias etc involved
companies have now moved to implicit models…turning knobs to figure out the best models
Chris recently added streaming recommendations to this workshop
Netflix runs huge batch jobs at night right now, and then do small tweaks every 15 minutes throughout the day
Chris sees Spark usage moving to SQL
Netflix uses Spark as a faster Hadoop
Spark Streaming does mini-batches over time…it is "definitely one of the crustier libraries in Spark" because things get difficult reasoning through
Spark Streaming usually runs for about 7 hours or so before crashing
some are starting to think of Spark as an application server like WebSphere, which is wrong!
Chris used a 60-node cluster at Netflix…but it would crash after 4 days, and he would use a lot of work
this cluster was bare metal and wasn't reliable like the cloud
Netflix was "fooled into thinking they could use Spark Streaming"
they had a huge infrastructure to handle something that could just be done offline
each Spark job takes at least 1 core
if you keep adding servers, you can continue to add users…otherwise you need to wait for a core
the highlight: don't think of Spark like an application server
Chris just started using Spring Boot…I made a verbal comment that it's awesome, and he repeated what I said 🙂
the cool thing about dating sites is that they do a lot of machine learning
bernie.ai focuses on images of human faces
faces compared at the pixel level takes way too long due to the huge vectors involved etc…the trick is using PCA (principal component analysis)
we'll use vectors of 8 rather than vectors of 2500
we'll go through 5 different ways to do recommendations using the same data set…
TopK where you don't know anything, collaborative filtering, user to user similarity, item to item similarity, etc
machine learning actually uses fairly fundamental concepts…turning letters etc to numbers
the stack that we're using is simpler than the one which would take 4 days to work through rather than 1 day
clicking on a face writes to Apache Kafka
Kafka doesn't auto scale very well, so a lot of empty cycles throughout the day at Netflix
Netflix is still a heavy Apache Cassandra user, but one thing you'll start seeing is Redis
something called the Dynamite API (based on the Dynamo paper) is used…this in-memory data is treated like a database
the TopK, top 10 movies are burned into the AMI in case things go wrong…it's not personalized due to the degraded state, but there's something
Chris worked on the edge…on the streaming team at Netflix called "the edge service team"…the code that devices would talk to
their streaming team depended on 30 or 40 other services…starting a movie etc
recommendations was one of the internal services
on the edge, you can actually batch multiple requests rather than do 1 call at a time
these matrices get passed into a GPU to do matrix multiplication
on the way back, you can then get vectors that are solutions to these matrices
this was built on top of the batching capabilities of Netflix open source
to do a search for something in GitHub, press "t"…in this case, search for "pipeline/kafkacassandra"
Spark Streaming is just waiting for data coming in from Kafka
Databricks stuck with Amazon Kinesis due to the issues with Kafka
Chris has been trying to get people to think of Elasticsearch as a database rather than just a search engine
Elasticsearch is very user friendly…it creates tables for you etc…but don't make a typo, or you'll be looking for your data
Netflix is using Elasticsearch just for searches
Elasticsearch should be thought of as an analytics engine
the name of the company was changed to "elastic" because of the connection with searches
you can actually co-locate Elasticsearch and Spark partitions so that processing is done locally…called data locality
Spark SQL has a Cassandra adapter
because Spark SQL is a generic framework, it needs to be fault tolerant
the point is that Presto just does queries…like Apache Impala…super optimized for reading
if distributed across a bunch of nodes…and one dies…the query is lost in the cases for tools like Impala
"Spark SQL lives life in paranoia…it's always expecting things to go down"
Spark SQL isn't as heavyweight as Hadoop
with Spark SQL, there's a way to follow lineage back to the source…so it can recover from node failure
Chris said he doesn't know Presto well…he knows how to tune Spark SQL very well
Netflix uses Presto and "swears by it"…and it "completely blows away Spark SQL"…so this is why Chris included Presto in this workshop
Chris will demo batch and real-time paths
Chris joked that "errors are fine…don't get too scared of errors"
demo.pipeline.io – "Spark After Dark", the application we are going to build
if you leave this page and come back, the user ID will get regenerated…for our Docker instances, we need to swap in the numbers Chris assigns us
Chris commented that he's the worst UI developer…someone had asked about the 3 actress / 3 actor limitation…and there's a bug with Firefox, so use Chrome
tuples of user ID, actress / actor ID, rating, and IP address are passed in…these tuples are passed in from Kafka
Spark Streaming code is "awful"…the "least intuitive code"
Spark is batch…Spark Streaming is batch…it's just a small batch
you might design for 500ms, but someone else might add another call…and now you're in an unstable state
don't change existing jobs…just create new ones when changes are needed
you really need to monitor and collect metrics when working with Spark Streaming
one big limitation with Spark Streaming is that it has 1 concept of time
Apache Flink has 3 concepts of time…creation time, time received, and "low watermark time" (as of such and such a time, you have received all messages…Spark Streaming is far removed from this capability)
Flink is much like Apache Storm…Flink can handle one record at a time…something called CEP (complex event processing)…there is no way to do this with Spark Streaming
there are so many shops that have started down the Spark road…Flink is trying to do more ML and batch, and Spark is trying to do more real-time
people want to use Flink, but they don't want to learn a new API etc
Spark Streaming doesn't even have libraries for CEP etc
"keep an eye on Project Beam"…"they're coming through the Valley quickly"
Chris thought Project Beam was just a streaming API…it's a specification from Google, which has been slowly open sourcing projects
Chris asked Google why they're doing this…open sourcing TensorFlow etc
Google used to just release papers like the MapReduce paper
Google typically releases specifications, but then someone releases a "shitty" implementation…so they just decided to start releasing code of their own
Google has cloud services that they want you to start using…so if they can convince you to start using their higher level API, they can potentially slowly migrate you to other things
"they're basically giving you pathways to migrate to their stuff"
for example, Google will say to just use their Kubernetes product…and then they say, oh well, it's tied to TensorFlow etc
Apache Zeppelin is accessible at http://demo.pipeline.io:3123
this Zeppelin notebook is pulling data out of Cassandra
Chris first talked through notebook "Live Recs/01: TopK and Summary Statistics (SQL, DataFrames)"
he then pointed out the Dynamite call, and commented that he could have used Redis or Memcache
there is no metadata about the items (actors / actresses)
…first break…
after the break, Chris started by talking through notebook "Live Recs/02: User-to-Item Collaborative Filter Recs (ALS)
taking the matrix on the left…picking some value "k"…k means that you don't know what you're looking for…for example, k-means etc
what we're doing here is low-ranked matrix factorization
we think there might be k latent features…features need to be numeric…we're taking something human (ratings)
ALS (alternating least squares)
all of these are approximations…taking 2 matrices that multiply together and get close enough to a snapshot
every Spark job that runs spins up a new cluster, so there is no multi-tenant thing
these are different types of EC2 instances…these are CPU heavy etc
each job has its own cluster…if there are errors, they can rule out other users
forget what you hear…Netflix is all AWS…no Google!
Netflix stopped playing the game playing cloud providers against each other…they're now fully committed to AWS
Netflix came late to Spark…they came to a point where they wanted to have something faster than Hadoop…they wanted to use Python etc
Netflix is leapfrogging over Spark Streaming
Netflix has a central Kafka cluster…instead of Spark Streaming they're using Apache Samza
Samza was a sister product to Kafka at LinkedIn
Samza sits behind Kafka solely for routing…one of the destinations from Samza is Spark Streaming, used by support teams trying to get it to work
at a Kafka conference, Chris heard a speaker say that they don't use Spark Streaming because "it's a piece of shit"
Chris doesn't know anyone else who uses Samza other than Netflix and LinkedIn
one of the workshop attendees asked about Apache NiFi, after Chris invited questions
NiFi came from the NSA…Chris talks to the NiFi folks all the time…they are in Washington DC where his sister lives…Hortenworks purchased NiFi
people don't understand where to use NiFi…it's intended as a producer to Kafka
say you're trying to find the best route from point A to point B…it's like Google Maps, but for really high security such as what government officials might need
the downside with NiFi is that once you leave the environment and move to Kafka, you lose data provenance
cosign similarity…picture a 4-dimensional space…a vector going through each…by angle you can figure out which users are similar to each other
one way to recommend is to group items…for this example, we just know what was chosen…we don't know age, name etc
it takes a human to continuously revise k into categories…such as movie categories
a lot of clusters end up being small (for example, Canadian horror films aimed at children between 10 and 13 years of age)…and you can filter these out in Netflix
the brute force way to compare items to items is by basically doing a Cartesian product…very computationally expensive, although also very parallelizable
class locality sense of hashing…a pull request has been logged for Spark Streaming for 2 years, and it looks like it will finally be implemented with Spark Streaming 2.1
"the math is tricky" and Chris "doesn't fully understand it"…so he won't go through the math, just the practical applications of some of these algorithms
Amazon Redshift HyperLogLog uses something like 16k to store approximate counts with only something like a 0.18% error rate
the best machine learning folks are physicists, because they are not exact
start thinking in terms of error bounds and time bounds
BlinkDB…search for the word "partial" in the Spark source code…there is a whole package dedicated to time bound queries
…getting set up for Docker project…
when the Docker script starts up, it does a git pull to grab the latest code…the image is 3 weeks old, but the latest code is from this morning
https://github.com/fluxcapacitor/pipeline/wiki/Start-Docker-Environment
Chris explained the following docker command: sudo docker run -itd –privileged –name pipeline –net=host -m 50g -e "SPRING_PROFILES_ACTIVE=local" fluxcapacitor/pipeline
with "–privileged", you get additional access
with "–net=host"…"docker networking is the most fucking confusing thing"…from a performance perspective this is extremely important…"don't tell the ops guys about any of this (root access etc), just do it!"
"Spring is not cool…but Spring Boot is cool"
"I feel like an idiot explaining tailing, but I've spoken in Seattle to a bunch of Windows guys before"
GitHub has a 100MB limit for files until you need to switch to the big file version
the "jps" command is the same as "ps", but it only returns Java processes that are running
an output snippet (some content removed) from my SSH session that shows versions of products being used…

...
declare -x AKKA_VERSION="2.3.11"
declare -x ALGEBIRD_VERSION="0.11.0"
declare -x ANKUR_PART_VERSION="0.1"
...
declare -x ATLAS_VERSION="1.4.5"
...
declare -x BAZEL_VERSION="0.2.2"
declare -x BETTER_FILES_VERSION="2.14.0"
...
declare -x CASSANDRA_VERSION="2.2.6"
declare -x CODAHALE_METRICS_VERSION="3.1.2"
declare -x COMMONS_DAEMON_VERSION="1.0.15"
...
declare -x CONFLUENT_VERSION="3.0.0"
...
declare -x DYNO_VERSION="1.4.6"
...
declare -x ELASTICSEARCH_VERSION="2.3.0"
declare -x FINAGLE_VERSION="6.34.0"
...
declare -x FLINK_VERSION="1.0.0"
declare -x GENSORT_VERSION="1.5"
declare -x GRAPHFRAMES_VERSION="0.1.0-spark1.6"
declare -x GUAVA_VERSION="14.0.1"
...
declare -x HADOOP_VERSION="2.6.0"
...
declare -x HIVE_VERSION="1.2.1"
...
declare -x HYSTRIX_DASHBOARD_VERSION="1.5.3"
declare -x HYSTRIX_VERSION="1.5.3"
declare -x INDEXEDRDD_VERSION="0.3"
declare -x JANINO_VERSION="2.7.8"
declare -x JAVA_HOME="/usr/lib/jvm/java-8-oracle"
declare -x JAVA_OPTS="-Xmx10G -XX:+CMSClassUnloadingEnabled"
declare -x JBLAS_VERSION="1.2.4"
declare -x JEDIS_VERSION="2.7.3"
...
declare -x JMETER_VERSION="3.0"
declare -x JPMML_SPARKML_VERSION="1.0.4"
declare -x JSON4S_VERSION="3.3.0"
declare -x KAFKA_CLIENT_VERSION="0.10.0.0"
...
declare -x KIBANA_VERSION="4.5.0"
...
declare -x LOGSTASH_VERSION="2.3.0"

...

declare -x MAXMIND_GEOIP_VERSION="2.5.0"
...
declare -x NIFI_HOME="/root/nifi-0.6.1"
declare -x NIFI_VERSION="0.6.1"
...
declare -x PATH="/root/pipeline/myapps/serving:/root/pipeline/myapps/serving/prediction:/root/dynomite:/root/apache-jmeter-3.0/bin:/root/titan-1.0.0-hadoop1/bin:/root/presto-server-0.137/bin:/root/airflow/bin:/root/flink-1.0.0/bin:/root/zeppelin-0.6.0/bin:/root/sbt/bin:/root/nifi-0.6.1/bin:/root/webdis:/root/redis-3.0.5/bin:/root/apache-hive-1.2.1-bin/bin:/root/hadoop-2.6.0/bin:/root/kibana-4.5.0-linux-x64/bin:/root/logstash-2.3.0/bin:/root/elasticsearch-2.3.0/bin:/root/confluent-3.0.0/bin:/root/confluent-3.0.0/bin:/root/spark-1.6.1-bin-fluxcapacitor/tachyon/bin:/root/spark-1.6.1-bin-fluxcapacitor/bin:/root/spark-1.6.1-bin-fluxcapacitor/sbin:/root/apache-cassandra-2.2.6/bin:/root/pipeline/bin/cli:/root/pipeline/bin/cluster:/root/pipeline/bin/docker:/root/pipeline/bin/initial:/root/pipeline/bin/kafka:/root/pipeline/bin/rest:/root/pipeline/bin/service:/root/pipeline/bin/util:/root/bazel-0.2.2/bin:/usr/lib/jvm/java-8-oracle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
...
declare -x PMML_EVALUATOR_VERSION="1.2.14"
declare -x PMML_MODEL_METRO_VERSION="1.2.15"
declare -x PMML_MODEL_VERSION="1.2.15"
...
declare -x PRESTO_VERSION="0.137"
...
declare -x REDIS_VERSION="3.0.5"
declare -x SBT_ASSEMBLY_PLUGIN_VERSION="0.14.0"
...
declare -x SBT_OPTS="-Xmx10G -XX:+CMSClassUnloadingEnabled"
declare -x SBT_SPARK_PACKAGES_PLUGIN_VERSION="0.2.3"
declare -x SBT_VERSION="0.13.9"
declare -x SCALATEST_VERSION="2.2.4"
declare -x SCALA_MAJOR_VERSION="2.10"
declare -x SCALA_VERSION="2.10.5"
...
declare -x SHLVL="1"
declare -x SPARK_AVRO_CONNECTOR_VERSION="2.0.1"
declare -x SPARK_CASSANDRA_CONNECTOR_VERSION="1.4.0"
declare -x SPARK_CSV_CONNECTOR_VERSION="1.4.0"
declare -x SPARK_ELASTICSEARCH_CONNECTOR_VERSION="2.3.0.BUILD-SNAPSHOT"
declare -x SPARK_EXAMPLES_JAR="/root/spark-1.6.1-bin-fluxcapacitor/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
...
declare -x SPARK_NIFI_CONNECTOR_VERSION="0.6.1"
declare -x SPARK_OTHER_VERSION="2.0.1-SNAPSHOT"
declare -x SPARK_REDIS_CONNECTOR_VERSION="0.2.0"
declare -x SPARK_REPOSITORIES="http://dl.bintray.com/spark-packages/maven,https://oss.sonatype.org/content/repositories/snapshots,https://repository.apache.org/content/groups/snapshots"
declare -x SPARK_SUBMIT_JARS="/root/pipeline/myapps/spark/redis/lib/spark-redis_2.10-0.2.0.jar,/usr/share/java/mysql-connector-java.jar,/root/pipeline/myapps/spark/ml/lib/spark-corenlp_2.10-0.1.jar,/root/pipeline/myapps/spark/ml/lib/stanford-corenlp-3.6.0-models.jar,/root/pipeline/myapps/spark/ml/target/scala-2.10/ml_2.10-1.0.jar,/root/pipeline/myapps/spark/sql/target/scala-2.10/sql_2.10-1.0.jar,/root/pipeline/myapps/spark/core/target/scala-2.10/core_2.10-1.0.jar,/root/pipeline/myapps/spark/streaming/target/scala-2.10/streaming_2.10-1.0.jar,/root/pipeline/myapps/serving/spark/target/scala-2.10/spark-serving_2.10-1.0.jar"
declare -x SPARK_SUBMIT_PACKAGES="tjhunter:tensorframes:0.2.2-s_2.10,com.maxmind.geoip2:geoip2:2.5.0,com.netflix.dyno:dyno-jedis:1.4.6,org.json4s:json4s-jackson_2.10:3.3.0,amplab:spark-indexedrdd:0.3,org.apache.spark:spark-streaming-kafka-assembly_2.10:1.6.1,org.elasticsearch:elasticsearch-spark_2.10:2.3.0.BUILD-SNAPSHOT,com.datastax.spark:spark-cassandra-connector_2.10:1.4.0,redis.clients:jedis:2.7.3,com.twitter:algebird-core_2.10:0.11.0,com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-csv_2.10:1.4.0,org.apache.nifi:nifi-spark-receiver:0.6.1,com.madhukaraphatak:java-sizeof_2.10:0.1,com.databricks:spark-xml_2.10:0.3.1,edu.stanford.nlp:stanford-corenlp:3.6.0,org.jblas:jblas:1.2.4,graphframes:graphframes:0.1.0-spark1.6"
declare -x SPARK_VERSION="1.6.1"
declare -x SPARK_XML_VERSION="0.3.1"
declare -x SPRING_BOOT_VERSION="1.3.5.RELEASE"
declare -x SPRING_CLOUD_VERSION="1.1.2.RELEASE"
declare -x SPRING_CORE_VERSION="4.3.0.RELEASE"
declare -x SPRING_PROFILES_ACTIVE="local"
declare -x STANFORD_CORENLP_VERSION="3.6.0"
...
declare -x TENSORFLOW_SERVING_VERSION="0.4.1"
declare -x TENSORFLOW_VERSION="0.9.0"
declare -x TENSORFRAMES_VERSION="0.2.2"
...
declare -x TITAN_VERSION="1.0.0-hadoop1"
...
declare -x ZEPPELIN_VERSION="0.6.0"
...

Zeppelin says that it supports Spark 2.0, but it's broken!
Spark defaults to port 4040, and will keep incrementing by 1 until it finds an open port…you'll get a warning each time…if there are 4 nodes running already, there will be 4 warning messages
every Spark SQL query needs to go to the Hive metastore, which is what Apache Derby is being used for (a guy attending a previous workshop session left after Derby was mentioned)
Chris mentioned a tool called "nbconvert" that "nobody knows about"…it gets past a huge blocking point for data scientists
ZeppelinHub is based on JupyterHub
since Zeppelin doesn't support Spark 2.0, just stick with Python…the Scala stuff is always prioritized for Spark
you can now save and load with Python…this was only possible with Scala before
every notebook has a problem with Scala where it's converting inner classes…there is a filename length limit
…lunch break…
a workshop attendee asked about Apache Airflow…it is one of the least developed areas of the project we are working through during this workshop
https://github.com/fluxcapacitor/pipeline/wiki/Demo-Code-Layout
Codegen uses Janino, which is used internally by Spark (Spark 1.5 / 1.6 used Tungsten…you know how large your CPU caches are etc)
Chris commented that Spark guys don't like PMML (predictive model markup language)…a workshop attendee had said that they used PPL in production, which Chris was surprised about since it's so new
another workshop attendee asked about Kafka security…Chris said that Confluent implemented this in v0.9 or v0.10…if you're not using the Confluent version, you're missing a lot
people at Netflix need to over-provision Kafka because it doesn't auto scale, and they have to essentially double this for if one region fails
how do Kafka and Spark Streaming partitions work together? they don't!
Chris is a big fan of Kafka Streams
the "MLlib" piece of the architecture diagram doesn't use spark ML (which requires data frames)
people have tended not to move to Spark Streams because they are too dependent on Spark tooling…they have painted themselves into a corner
http://keystone-ml.org/
Chris is afraid of "data sets" (in contrast to data frames) because these do not use types etc, and issues aren't found until runtime
Zeppelin uses JSON, which is not rendered in GitHub…Jython does work
notebooks are great for reproducability because you can version output as well
https://github.com/fluxcapacitor/pipeline/wiki/Explore-Services
rest-submit-sparkpi-job.sh…one of the most requested features…available in "hidden" Spark API…not publicly part of the open source project, but is available…port 6066
the official Databricks response to discussion of this API is that it is experimental
SparkPackages is managed by Databricks in a separate Maven respository…used by Apache Spark code
another snippet from my SSH session, hitting Cassandra…

root@pipeline-training-v5-711x:~/pipeline# cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 2.2.6 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> use advancedspark;
cqlsh:advancedspark> SELECT userid, itemid, rating, timestamp, geocity FROM item_ratings where userid = 21619;

userid | itemid | rating | timestamp     | geocity
--------+--------+--------+---------------+---------------
21619 | 10012 |      1 | 1465049318000 | San Francisco
21619 | 10013 |      1 | 1465049318000 | San Francisco
21619 | 10014 |      1 | 1465049318000 | San Francisco
21619 | 90001 |      1 | 1465049312000 | San Francisco
21619 | 90013 |      1 | 1465049314000 | San Francisco
21619 | 90015 |      1 | 1465049314000 | San Francisco

(6 rows)
cqlsh:advancedspark>

…didn't take many notes after this point…getting exhausted…
"pushing a canary" is essentially spinning up a single new instance amongst thousands of existing servers to monitor it and compare it before pushing out changes to other servers…only a small portion of users will be affected if it doesn't run well
streaming algorithms are very difficult to deploy
nobody knew about Janino until Spark 2.0 started using it…Janino is a super small, super fast Java compiler…a dynamic bytecode generator
https://github.com/fluxcapacitor/pipeline/wiki/Streaming-Probabilistic-Algos
with Tungsten, Spark debugging became much more complex…Spark knows its workload, so Tungsten bypasses Java garbage collection etc
Algebird
Zeppelin enables getting and putting between Scala and Python
Chris said that "Machine Learning with Spark" is a good book (and it looks like a second edition will be released in November 2016)…he actually recruited the author for the IBM group that he is in right now
parquet data is stored according to column values rather than row values…ORC (optimized row columnar) is another type of columnar data storage
for some reason, Kafka is big on using Avro…it sounds like Avro messages don't include the actual schema, but a registered schema ID and just the data…reducing the size of messages

Subscribe to Erik on Software