You, This Course and Us
  • You, This Course and Us
  • Course Materials
Introduction to Spark
  • What does Donald Rumsfeld have to do with data analysis?
  • Why is Spark so cool?
  • An introduction to RDDs - Resilient Distributed Datasets
  • Built-in libraries for Spark
  • Installing Spark
  • The PySpark Shell
  • Transformations and Actions
  • See it in Action : Munging Airlines Data with PySpark - I
  • [For Linux/Mac OS Shell Newbies] Path and other Environment Variables
Resilient Distributed Datasets
  • RDD Characteristics: Partitions and Immutability
  • RDD Characteristics: Lineage, RDDs know where they came from
  • What can you do with RDDs?
  • Create your first RDD from a file
  • Average distance travelled by a flight using map() and reduce() operations
  • Get delayed flights using filter(), cache data using persist()
  • Average flight delay in one-step using aggregate()
  • Frequency histogram of delays using countByValue()
  • See it in Action : Analyzing Airlines Data with PySpark - II
Advanced RDDs: Pair Resilient Distributed Datasets
  • Special Transformations and Actions
  • Average delay per airport, use reduceByKey(), mapValues() and join()
  • Average delay per airport in one step using combineByKey()
  • Get the top airports by delay using sortBy()
  • Lookup airport descriptions using lookup(), collectAsMap(), broadcast()
  • See it in Action : Analyzing Airlines Data with PySpark - III
Advanced Spark: Accumulators, Spark Submit, MapReduce , Behind The Scenes
  • Get information from individual processing nodes using accumulators
  • See it in Action : Using an Accumulator variable
  • Long running programs using spark-submit
  • See it in Action : Running a Python script with Spark-Submit
  • Behind the scenes: What happens when a Spark script runs?
  • Running MapReduce operations
  • See it in Action : MapReduce with Spark
Java and Spark
  • The Java API and Function objects
  • Pair RDDs in Java
  • Running Java code
  • Installing Maven
  • See it in Action : Running a Spark Job with Java
PageRank: Ranking Search Results
  • What is PageRank?
  • The PageRank algorithm
  • Implement PageRank in Spark
  • Join optimization in PageRank using Custom Partitioning
  • See it Action : The PageRank algorithm using Spark
Spark SQL
  • Dataframes: RDDs + Tables
  • See it in Action : Dataframes and Spark SQL
MLlib in Spark: Build a recommendations engine
  • Collaborative filtering algorithms
  • Latent Factor Analysis with the Alternating Least Squares method
  • Music recommendations using the Audioscrobbler dataset
  • Implement code in Spark using MLlib
Spark Streaming
  • Introduction to streaming
  • Implement stream processing in Spark using Dstreams
  • Stateful transformations using sliding windows
  • See it in Action : Spark Streaming
Graph Libraries
  • The Marvel social network using Graphs