Workshop Notes (August 27, 2016): End-to-End Streaming ML Recommendation Pipeline (Spark 2.0, Kafka, TensorFlow)
August 27, 2016
9:00 AM – 5:00 PM
From the promotional materials:
END-TO-END STREAMING ML RECOMMENDATION PIPELINE WORKSHOP
Learn to build an end-to-end, streaming recommendations pipeline using the latest streaming analytics tools inside a portable, take-home Docker Container in the cloud!
DATE Saturday, August 27, 2016
TIME 9:00 AM Central Standard Time
VENUE 2.03 Classroom
You’ll learn how to:
- Create a complete, end-to-end streaming data analytics pipeline
- Interactively analyze, approximate, and visualize streaming data
- Generate machine learning, graph & NLP recommendation models
- Productionize our ML models to serve real-time recommendations
- Perform a hybrid on-premise and cloud deployment using Docker
- Customize this workshop environment to your specific use cases
Agenda:
Part 1 (Analytics and Visualizations)
- Analytics and Visualizations Overview (Live Demo!)
- Verify Environment Setup (Docker)
- Notebooks (Zeppelin, Jupyter/iPython)
- Interactive Data Analytics (Spark SQL, Hive, Presto)
- Graph Analytics (Spark Graph, NetworkX, TitanDB)
- Time-series Analytics (Cassandra)
- Visualizations (Kibana, Matplotlib, D3)
- Approximate Queries (Spark SQL, Redis, Algebird)
- Workflow Management (AirFlow)
Part 2 (Streaming and Recommendations)
- Streaming and Recommendations Overview (Live Demo!)
- Streaming (NiFi, Kafka, Spark Streaming, Flink)
- Cluster-based Recommendation (Spark ML, Scikit-Learn)
- Graph-based Recommendation (Spark ML, Spark Graph)
- Collaborative-based Recommendation (Spark ML)
- NLP-based Recommendation (CoreNLP, NLTK)
- Geo-based Recommendation (ElasticSearch)
- Hybrid On-Premise+Cloud Auto-Scale Deploy (Docker)
- Customize the Workshop Environment for Your Use Cases
Target Audience:
- Interest in learning more about the streaming data pipelines that power their real-time machine learning models and visualizations
- Interest in building more intuition about machine learning, graph processing, natural language processing, statistical approximation techniques, and visualizations
- Interest in learning the practical applications of a modern, streaming data analytics and recommendations pipeline
- Anyone who wants to try 3D-printed PANCAKES!!
Prerequisites:
- Basic familiarity with Unix/Linux commands
- Experience in SQL, Java, Scala, Python, or R
- Basic familiarity with linear algebra concepts (dot product)
- Laptop with an ssh client and modern browser
- Every attendee will get their own fully-configured cloud instance running the entire environment
- At the end of the workshop, you will be able to save and download your environment to your local laptop in the form of a Docker image
My personal notes:
- the workshop description can be found at this link
- there were between 80 and 90 attendees to this event
- Chris had originally booked another location, but moved to 1871 due to the demand
- yes…I realize that the photo of Chris above isn't very clear…I was in the back row using my Android phone camera
- attendee backgrounds were mixed…some were from startups…the individuals in my area of the workshop work at Cloudera and Uptake
- Chris will be keynoting at Big Data Spain this coming November
- the PANCAKE STACK comes from folks in the Bay area, and replaces/updates an earlier stack
- about 35 technologies will be introduced during this workshop
- Chris asked everyone if they understood the limitations of Apache HiveQL
- Apache Spark now supports the SQL:2003 specification
- the workshop helpers today are Christina, Mark, and Ross
- Mark and Ross are former colleagues of Chris
- due to workshop time constraints, specific use cases will not be covered
- will talk about clustering (unsupervised; without label), and classification (supervised; with label)
- the whole point of building machine learning is to provide recommendations etc
- Chris worked for Netflix from 2010-2012…work was all about recommendations and collaborative filtering
- the Netflix star rating system is useless due to all of the bias etc involved
- companies have now moved to implicit models…turning knobs to figure out the best models
- Chris recently added streaming recommendations to this workshop
- Netflix runs huge batch jobs at night right now, and then do small tweaks every 15 minutes throughout the day
- Chris sees Spark usage moving to SQL
- Netflix uses Spark as a faster Hadoop
- Spark Streaming does mini-batches over time…it is "definitely one of the crustier libraries in Spark" because things get difficult reasoning through
- Spark Streaming usually runs for about 7 hours or so before crashing
- some are starting to think of Spark as an application server like WebSphere, which is wrong!
- Chris used a 60-node cluster at Netflix…but it would crash after 4 days, and he would use a lot of work
- this cluster was bare metal and wasn't reliable like the cloud
- Netflix was "fooled into thinking they could use Spark Streaming"
- they had a huge infrastructure to handle something that could just be done offline
- each Spark job takes at least 1 core
- if you keep adding servers, you can continue to add users…otherwise you need to wait for a core
- the highlight: don't think of Spark like an application server
- Chris just started using Spring Boot…I made a verbal comment that it's awesome, and he repeated what I said 🙂
- the cool thing about dating sites is that they do a lot of machine learning
- bernie.ai focuses on images of human faces
- faces compared at the pixel level takes way too long due to the huge vectors involved etc…the trick is using PCA (principal component analysis)
- we'll use vectors of 8 rather than vectors of 2500
- we'll go through 5 different ways to do recommendations using the same data set…
- TopK where you don't know anything, collaborative filtering, user to user similarity, item to item similarity, etc
- machine learning actually uses fairly fundamental concepts…turning letters etc to numbers
- the stack that we're using is simpler than the one which would take 4 days to work through rather than 1 day
- clicking on a face writes to Apache Kafka
- Kafka doesn't auto scale very well, so a lot of empty cycles throughout the day at Netflix
- Netflix is still a heavy Apache Cassandra user, but one thing you'll start seeing is Redis
- something called the Dynamite API (based on the Dynamo paper) is used…this in-memory data is treated like a database
- the TopK, top 10 movies are burned into the AMI in case things go wrong…it's not personalized due to the degraded state, but there's something
- Chris worked on the edge…on the streaming team at Netflix called "the edge service team"…the code that devices would talk to
- their streaming team depended on 30 or 40 other services…starting a movie etc
- recommendations was one of the internal services
- on the edge, you can actually batch multiple requests rather than do 1 call at a time
- these matrices get passed into a GPU to do matrix multiplication
- on the way back, you can then get vectors that are solutions to these matrices
- this was built on top of the batching capabilities of Netflix open source
- to do a search for something in GitHub, press "t"…in this case, search for "pipeline/kafkacassandra"
- Spark Streaming is just waiting for data coming in from Kafka
- Databricks stuck with Amazon Kinesis due to the issues with Kafka
- Chris has been trying to get people to think of Elasticsearch as a database rather than just a search engine
- Elasticsearch is very user friendly…it creates tables for you etc…but don't make a typo, or you'll be looking for your data
- Netflix is using Elasticsearch just for searches
- Elasticsearch should be thought of as an analytics engine
- the name of the company was changed to "elastic" because of the connection with searches
- you can actually co-locate Elasticsearch and Spark partitions so that processing is done locally…called data locality
- Spark SQL has a Cassandra adapter
- because Spark SQL is a generic framework, it needs to be fault tolerant
- the point is that Presto just does queries…like Apache Impala…super optimized for reading
- if distributed across a bunch of nodes…and one dies…the query is lost in the cases for tools like Impala
- "Spark SQL lives life in paranoia…it's always expecting things to go down"
- Spark SQL isn't as heavyweight as Hadoop
- with Spark SQL, there's a way to follow lineage back to the source…so it can recover from node failure
- Chris said he doesn't know Presto well…he knows how to tune Spark SQL very well
- Netflix uses Presto and "swears by it"…and it "completely blows away Spark SQL"…so this is why Chris included Presto in this workshop
- Chris will demo batch and real-time paths
- Chris joked that "errors are fine…don't get too scared of errors"
- demo.pipeline.io – "Spark After Dark", the application we are going to build
- if you leave this page and come back, the user ID will get regenerated…for our Docker instances, we need to swap in the numbers Chris assigns us
- Chris commented that he's the worst UI developer…someone had asked about the 3 actress / 3 actor limitation…and there's a bug with Firefox, so use Chrome
- tuples of user ID, actress / actor ID, rating, and IP address are passed in…these tuples are passed in from Kafka
- Spark Streaming code is "awful"…the "least intuitive code"
- Spark is batch…Spark Streaming is batch…it's just a small batch
- you might design for 500ms, but someone else might add another call…and now you're in an unstable state
- don't change existing jobs…just create new ones when changes are needed
- you really need to monitor and collect metrics when working with Spark Streaming
- one big limitation with Spark Streaming is that it has 1 concept of time
- Apache Flink has 3 concepts of time…creation time, time received, and "low watermark time" (as of such and such a time, you have received all messages…Spark Streaming is far removed from this capability)
- Flink is much like Apache Storm…Flink can handle one record at a time…something called CEP (complex event processing)…there is no way to do this with Spark Streaming
- there are so many shops that have started down the Spark road…Flink is trying to do more ML and batch, and Spark is trying to do more real-time
- people want to use Flink, but they don't want to learn a new API etc
- Spark Streaming doesn't even have libraries for CEP etc
- "keep an eye on Project Beam"…"they're coming through the Valley quickly"
- Chris thought Project Beam was just a streaming API…it's a specification from Google, which has been slowly open sourcing projects
- Chris asked Google why they're doing this…open sourcing TensorFlow etc
- Google used to just release papers like the MapReduce paper
- Google typically releases specifications, but then someone releases a "shitty" implementation…so they just decided to start releasing code of their own
- Google has cloud services that they want you to start using…so if they can convince you to start using their higher level API, they can potentially slowly migrate you to other things
- "they're basically giving you pathways to migrate to their stuff"
- for example, Google will say to just use their Kubernetes product…and then they say, oh well, it's tied to TensorFlow etc
- Apache Zeppelin is accessible at http://demo.pipeline.io:3123
- this Zeppelin notebook is pulling data out of Cassandra
- Chris first talked through notebook "Live Recs/01: TopK and Summary Statistics (SQL, DataFrames)"
- he then pointed out the Dynamite call, and commented that he could have used Redis or Memcache
- there is no metadata about the items (actors / actresses)
- …first break…
- after the break, Chris started by talking through notebook "Live Recs/02: User-to-Item Collaborative Filter Recs (ALS)
- taking the matrix on the left…picking some value "k"…k means that you don't know what you're looking for…for example, k-means etc
- what we're doing here is low-ranked matrix factorization
- we think there might be k latent features…features need to be numeric…we're taking something human (ratings)
- ALS (alternating least squares)
- all of these are approximations…taking 2 matrices that multiply together and get close enough to a snapshot
- every Spark job that runs spins up a new cluster, so there is no multi-tenant thing
- these are different types of EC2 instances…these are CPU heavy etc
- each job has its own cluster…if there are errors, they can rule out other users
- forget what you hear…Netflix is all AWS…no Google!
- Netflix stopped playing the game playing cloud providers against each other…they're now fully committed to AWS
- Netflix came late to Spark…they came to a point where they wanted to have something faster than Hadoop…they wanted to use Python etc
- Netflix is leapfrogging over Spark Streaming
- Netflix has a central Kafka cluster…instead of Spark Streaming they're using Apache Samza
- Samza was a sister product to Kafka at LinkedIn
- Samza sits behind Kafka solely for routing…one of the destinations from Samza is Spark Streaming, used by support teams trying to get it to work
- at a Kafka conference, Chris heard a speaker say that they don't use Spark Streaming because "it's a piece of shit"
- Chris doesn't know anyone else who uses Samza other than Netflix and LinkedIn
- one of the workshop attendees asked about Apache NiFi, after Chris invited questions
- NiFi came from the NSA…Chris talks to the NiFi folks all the time…they are in Washington DC where his sister lives…Hortenworks purchased NiFi
- people don't understand where to use NiFi…it's intended as a producer to Kafka
- say you're trying to find the best route from point A to point B…it's like Google Maps, but for really high security such as what government officials might need
- the downside with NiFi is that once you leave the environment and move to Kafka, you lose data provenance
- cosign similarity…picture a 4-dimensional space…a vector going through each…by angle you can figure out which users are similar to each other
- one way to recommend is to group items…for this example, we just know what was chosen…we don't know age, name etc
- it takes a human to continuously revise k into categories…such as movie categories
- a lot of clusters end up being small (for example, Canadian horror films aimed at children between 10 and 13 years of age)…and you can filter these out in Netflix
- the brute force way to compare items to items is by basically doing a Cartesian product…very computationally expensive, although also very parallelizable
- class locality sense of hashing…a pull request has been logged for Spark Streaming for 2 years, and it looks like it will finally be implemented with Spark Streaming 2.1
- "the math is tricky" and Chris "doesn't fully understand it"…so he won't go through the math, just the practical applications of some of these algorithms
- Amazon Redshift HyperLogLog uses something like 16k to store approximate counts with only something like a 0.18% error rate
- the best machine learning folks are physicists, because they are not exact
- start thinking in terms of error bounds and time bounds
- BlinkDB…search for the word "partial" in the Spark source code…there is a whole package dedicated to time bound queries
- …getting set up for Docker project…
- when the Docker script starts up, it does a git pull to grab the latest code…the image is 3 weeks old, but the latest code is from this morning
- https://github.com/fluxcapacitor/pipeline/wiki/Start-Docker-Environment
- Chris explained the following docker command: sudo docker run -itd –privileged –name pipeline –net=host -m 50g -e "SPRING_PROFILES_ACTIVE=local" fluxcapacitor/pipeline
- with "–privileged", you get additional access
- with "–net=host"…"docker networking is the most fucking confusing thing"…from a performance perspective this is extremely important…"don't tell the ops guys about any of this (root access etc), just do it!"
- "Spring is not cool…but Spring Boot is cool"
- "I feel like an idiot explaining tailing, but I've spoken in Seattle to a bunch of Windows guys before"
- GitHub has a 100MB limit for files until you need to switch to the big file version
- the "jps" command is the same as "ps", but it only returns Java processes that are running
- an output snippet (some content removed) from my SSH session that shows versions of products being used…
...
declare -x AKKA_VERSION="2.3.11"
declare -x ALGEBIRD_VERSION="0.11.0"
declare -x ANKUR_PART_VERSION="0.1"
...
declare -x ATLAS_VERSION="1.4.5"
...
declare -x BAZEL_VERSION="0.2.2"
declare -x BETTER_FILES_VERSION="2.14.0"
...
declare -x CASSANDRA_VERSION="2.2.6"
declare -x CODAHALE_METRICS_VERSION="3.1.2"
declare -x COMMONS_DAEMON_VERSION="1.0.15"
...
declare -x CONFLUENT_VERSION="3.0.0"
...
declare -x DYNO_VERSION="1.4.6"
...
declare -x ELASTICSEARCH_VERSION="2.3.0"
declare -x FINAGLE_VERSION="6.34.0"
...
declare -x FLINK_VERSION="1.0.0"
declare -x GENSORT_VERSION="1.5"
declare -x GRAPHFRAMES_VERSION="0.1.0-spark1.6"
declare -x GUAVA_VERSION="14.0.1"
...
declare -x HADOOP_VERSION="2.6.0"
...
declare -x HIVE_VERSION="1.2.1"
...
declare -x HYSTRIX_DASHBOARD_VERSION="1.5.3"
declare -x HYSTRIX_VERSION="1.5.3"
declare -x INDEXEDRDD_VERSION="0.3"
declare -x JANINO_VERSION="2.7.8"
declare -x JAVA_HOME="/usr/lib/jvm/java-8-oracle"
declare -x JAVA_OPTS="-Xmx10G -XX:+CMSClassUnloadingEnabled"
declare -x JBLAS_VERSION="1.2.4"
declare -x JEDIS_VERSION="2.7.3"
...
declare -x JMETER_VERSION="3.0"
declare -x JPMML_SPARKML_VERSION="1.0.4"
declare -x JSON4S_VERSION="3.3.0"
declare -x KAFKA_CLIENT_VERSION="0.10.0.0"
...
declare -x KIBANA_VERSION="4.5.0"
...
declare -x LOGSTASH_VERSION="2.3.0"
...
declare -x MAXMIND_GEOIP_VERSION="2.5.0"
...
declare -x NIFI_HOME="/root/nifi-0.6.1"
declare -x NIFI_VERSION="0.6.1"
...
declare -x PATH="/root/pipeline/myapps/serving:/root/pipeline/myapps/serving/prediction:/root/dynomite:/root/apache-jmeter-3.0/bin:/root/titan-1.0.0-hadoop1/bin:/root/presto-server-0.137/bin:/root/airflow/bin:/root/flink-1.0.0/bin:/root/zeppelin-0.6.0/bin:/root/sbt/bin:/root/nifi-0.6.1/bin:/root/webdis:/root/redis-3.0.5/bin:/root/apache-hive-1.2.1-bin/bin:/root/hadoop-2.6.0/bin:/root/kibana-4.5.0-linux-x64/bin:/root/logstash-2.3.0/bin:/root/elasticsearch-2.3.0/bin:/root/confluent-3.0.0/bin:/root/confluent-3.0.0/bin:/root/spark-1.6.1-bin-fluxcapacitor/tachyon/bin:/root/spark-1.6.1-bin-fluxcapacitor/bin:/root/spark-1.6.1-bin-fluxcapacitor/sbin:/root/apache-cassandra-2.2.6/bin:/root/pipeline/bin/cli:/root/pipeline/bin/cluster:/root/pipeline/bin/docker:/root/pipeline/bin/initial:/root/pipeline/bin/kafka:/root/pipeline/bin/rest:/root/pipeline/bin/service:/root/pipeline/bin/util:/root/bazel-0.2.2/bin:/usr/lib/jvm/java-8-oracle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
...
declare -x PMML_EVALUATOR_VERSION="1.2.14"
declare -x PMML_MODEL_METRO_VERSION="1.2.15"
declare -x PMML_MODEL_VERSION="1.2.15"
...
declare -x PRESTO_VERSION="0.137"
...
declare -x REDIS_VERSION="3.0.5"
declare -x SBT_ASSEMBLY_PLUGIN_VERSION="0.14.0"
...
declare -x SBT_OPTS="-Xmx10G -XX:+CMSClassUnloadingEnabled"
declare -x SBT_SPARK_PACKAGES_PLUGIN_VERSION="0.2.3"
declare -x SBT_VERSION="0.13.9"
declare -x SCALATEST_VERSION="2.2.4"
declare -x SCALA_MAJOR_VERSION="2.10"
declare -x SCALA_VERSION="2.10.5"
...
declare -x SHLVL="1"
declare -x SPARK_AVRO_CONNECTOR_VERSION="2.0.1"
declare -x SPARK_CASSANDRA_CONNECTOR_VERSION="1.4.0"
declare -x SPARK_CSV_CONNECTOR_VERSION="1.4.0"
declare -x SPARK_ELASTICSEARCH_CONNECTOR_VERSION="2.3.0.BUILD-SNAPSHOT"
declare -x SPARK_EXAMPLES_JAR="/root/spark-1.6.1-bin-fluxcapacitor/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
...
declare -x SPARK_NIFI_CONNECTOR_VERSION="0.6.1"
declare -x SPARK_OTHER_VERSION="2.0.1-SNAPSHOT"
declare -x SPARK_REDIS_CONNECTOR_VERSION="0.2.0"
declare -x SPARK_REPOSITORIES="http://dl.bintray.com/spark-packages/maven,https://oss.sonatype.org/content/repositories/snapshots,https://repository.apache.org/content/groups/snapshots"
declare -x SPARK_SUBMIT_JARS="/root/pipeline/myapps/spark/redis/lib/spark-redis_2.10-0.2.0.jar,/usr/share/java/mysql-connector-java.jar,/root/pipeline/myapps/spark/ml/lib/spark-corenlp_2.10-0.1.jar,/root/pipeline/myapps/spark/ml/lib/stanford-corenlp-3.6.0-models.jar,/root/pipeline/myapps/spark/ml/target/scala-2.10/ml_2.10-1.0.jar,/root/pipeline/myapps/spark/sql/target/scala-2.10/sql_2.10-1.0.jar,/root/pipeline/myapps/spark/core/target/scala-2.10/core_2.10-1.0.jar,/root/pipeline/myapps/spark/streaming/target/scala-2.10/streaming_2.10-1.0.jar,/root/pipeline/myapps/serving/spark/target/scala-2.10/spark-serving_2.10-1.0.jar"
declare -x SPARK_SUBMIT_PACKAGES="tjhunter:tensorframes:0.2.2-s_2.10,com.maxmind.geoip2:geoip2:2.5.0,com.netflix.dyno:dyno-jedis:1.4.6,org.json4s:json4s-jackson_2.10:3.3.0,amplab:spark-indexedrdd:0.3,org.apache.spark:spark-streaming-kafka-assembly_2.10:1.6.1,org.elasticsearch:elasticsearch-spark_2.10:2.3.0.BUILD-SNAPSHOT,com.datastax.spark:spark-cassandra-connector_2.10:1.4.0,redis.clients:jedis:2.7.3,com.twitter:algebird-core_2.10:0.11.0,com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-csv_2.10:1.4.0,org.apache.nifi:nifi-spark-receiver:0.6.1,com.madhukaraphatak:java-sizeof_2.10:0.1,com.databricks:spark-xml_2.10:0.3.1,edu.stanford.nlp:stanford-corenlp:3.6.0,org.jblas:jblas:1.2.4,graphframes:graphframes:0.1.0-spark1.6"
declare -x SPARK_VERSION="1.6.1"
declare -x SPARK_XML_VERSION="0.3.1"
declare -x SPRING_BOOT_VERSION="1.3.5.RELEASE"
declare -x SPRING_CLOUD_VERSION="1.1.2.RELEASE"
declare -x SPRING_CORE_VERSION="4.3.0.RELEASE"
declare -x SPRING_PROFILES_ACTIVE="local"
declare -x STANFORD_CORENLP_VERSION="3.6.0"
...
declare -x TENSORFLOW_SERVING_VERSION="0.4.1"
declare -x TENSORFLOW_VERSION="0.9.0"
declare -x TENSORFRAMES_VERSION="0.2.2"
...
declare -x TITAN_VERSION="1.0.0-hadoop1"
...
declare -x ZEPPELIN_VERSION="0.6.0"
...
- Zeppelin says that it supports Spark 2.0, but it's broken!
- Spark defaults to port 4040, and will keep incrementing by 1 until it finds an open port…you'll get a warning each time…if there are 4 nodes running already, there will be 4 warning messages
- every Spark SQL query needs to go to the Hive metastore, which is what Apache Derby is being used for (a guy attending a previous workshop session left after Derby was mentioned)
- Chris mentioned a tool called "nbconvert" that "nobody knows about"…it gets past a huge blocking point for data scientists
- ZeppelinHub is based on JupyterHub
- since Zeppelin doesn't support Spark 2.0, just stick with Python…the Scala stuff is always prioritized for Spark
- you can now save and load with Python…this was only possible with Scala before
- every notebook has a problem with Scala where it's converting inner classes…there is a filename length limit
- …lunch break…
- a workshop attendee asked about Apache Airflow…it is one of the least developed areas of the project we are working through during this workshop
- https://github.com/fluxcapacitor/pipeline/wiki/Demo-Code-Layout
- Codegen uses Janino, which is used internally by Spark (Spark 1.5 / 1.6 used Tungsten…you know how large your CPU caches are etc)
- Chris commented that Spark guys don't like PMML (predictive model markup language)…a workshop attendee had said that they used PPL in production, which Chris was surprised about since it's so new
- another workshop attendee asked about Kafka security…Chris said that Confluent implemented this in v0.9 or v0.10…if you're not using the Confluent version, you're missing a lot
- people at Netflix need to over-provision Kafka because it doesn't auto scale, and they have to essentially double this for if one region fails
- how do Kafka and Spark Streaming partitions work together? they don't!
- Chris is a big fan of Kafka Streams
- the "MLlib" piece of the architecture diagram doesn't use spark ML (which requires data frames)
- people have tended not to move to Spark Streams because they are too dependent on Spark tooling…they have painted themselves into a corner
- http://keystone-ml.org/
- Chris is afraid of "data sets" (in contrast to data frames) because these do not use types etc, and issues aren't found until runtime
- Zeppelin uses JSON, which is not rendered in GitHub…Jython does work
- notebooks are great for reproducability because you can version output as well
- https://github.com/fluxcapacitor/pipeline/wiki/Explore-Services
- rest-submit-sparkpi-job.sh…one of the most requested features…available in "hidden" Spark API…not publicly part of the open source project, but is available…port 6066
- the official Databricks response to discussion of this API is that it is experimental
- SparkPackages is managed by Databricks in a separate Maven respository…used by Apache Spark code
- another snippet from my SSH session, hitting Cassandra…
root@pipeline-training-v5-711x:~/pipeline# cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 2.2.6 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> use advancedspark;
cqlsh:advancedspark> SELECT userid, itemid, rating, timestamp, geocity FROM item_ratings where userid = 21619;
userid | itemid | rating | timestamp | geocity
--------+--------+--------+---------------+---------------
21619 | 10012 | 1 | 1465049318000 | San Francisco
21619 | 10013 | 1 | 1465049318000 | San Francisco
21619 | 10014 | 1 | 1465049318000 | San Francisco
21619 | 90001 | 1 | 1465049312000 | San Francisco
21619 | 90013 | 1 | 1465049314000 | San Francisco
21619 | 90015 | 1 | 1465049314000 | San Francisco
(6 rows)
cqlsh:advancedspark>
- …didn't take many notes after this point…getting exhausted…
- "pushing a canary" is essentially spinning up a single new instance amongst thousands of existing servers to monitor it and compare it before pushing out changes to other servers…only a small portion of users will be affected if it doesn't run well
- streaming algorithms are very difficult to deploy
- nobody knew about Janino until Spark 2.0 started using it…Janino is a super small, super fast Java compiler…a dynamic bytecode generator
- https://github.com/fluxcapacitor/pipeline/wiki/Streaming-Probabilistic-Algos
- with Tungsten, Spark debugging became much more complex…Spark knows its workload, so Tungsten bypasses Java garbage collection etc
- Algebird
- Zeppelin enables getting and putting between Scala and Python
- Chris said that "Machine Learning with Spark" is a good book (and it looks like a second edition will be released in November 2016)…he actually recruited the author for the IBM group that he is in right now
- parquet data is stored according to column values rather than row values…ORC (optimized row columnar) is another type of columnar data storage
- for some reason, Kafka is big on using Avro…it sounds like Avro messages don't include the actual schema, but a registered schema ID and just the data…reducing the size of messages