Introduction
  • Welcome
  • Downloading the Code
  • Module 1 - Introduction
  • Spark Architecture and RDDs
Getting Started
  • Warning - Java 9+ is not supported by Spark 2. You can optionally use Spark 3.
  • Installing Spark
Reduces on RDDs
  • Reduces on RDDs
Mapping and Outputting
  • Mapping Operations
  • Outputting Results to the Console
  • Counting Big Data Items
  • If you've had a "NotSerializableException" in Spark
Tuples
  • RDDs of Objects
  • Tuples and RDDs
PairRDDs
  • Overview of PairRDDs
  • Building a PairRDD
  • Coding a ReduceByKey
  • Using the Fluent API
  • Grouping By Key
FlatMaps and Filters
  • FlatMaps
  • Filters
Reading from Disk
  • Reading from Disk
Keyword Ranking Practical
  • Practical Requirements
  • Worked Solution
  • Worked Solution (continued) with Sorting
Sorts and Coalesce
  • Why do sorts not work with foreach in Spark?
  • Why Coalesce is the Wrong Solution
  • What is Coalesce used for in Spark?
Deploying to AWS EMR (Optional)
  • How to start an EMR Spark Cluster
  • Packing a Spark Jar for EMR
  • Running a Spark Job on EMR
  • Understanding the Job Progress Output
  • Calculating EMR costs and Terminating the cluster
Joins
  • Inner Joins
  • Left Outer Joins and Optionals
  • Right Outer Joins
  • Full Joins and Cartesians
Big Data Big Exercise
  • Introducing the Requirements
  • Warmup
  • Main Exercise Requirments
  • Walkthrough - Step 2
  • Walkthrough - Step 3
  • Walkthrough - Step 4
  • Walkthrough - Step 5
  • Walkthrough - Step 6
  • Walkthrough - Step 7
  • Walkthrough - Step 8
  • Walkthrough - Step 9, adding titles and using the Big Data file
RDD Performance
  • Transformations and Actions
  • The DAG and SparkUI
  • Narrow vs Wide Transformations
  • Shuffles
  • Dealing with Key Skews
  • Avoiding groupByKey and using map-side-reduces instead
  • Caching and Persistence
Module 2 - Chapter 1 SparkSQL Introduction
  • Code for SQL/DataFrames Section
  • Introducing SparkSQL
SparkSQL Getting Started
  • SparkSQL Getting Started
Datasets
  • Dataset Basics
  • Filters using Expressions
  • Filters using Lambdas
  • Filters using Columns
The Full SQL Syntax
  • Using a Spark Temporary View for SQL
In Memory Data
  • In Memory Data
Groupings and Aggregations
  • Groupings and Aggregations
Date Formatting
  • Date Formatting
Multiple Groupings
  • Multiple Groupings
Ordering
  • Ordering
DataFrames API
  • SQL vs DataFrames
  • DataFrame Grouping
Pivot Tables
  • How does a Pivot Table work?
  • Coding a Pivot Table in Spark
More Aggregations
  • How to use the agg method in Spark
Practical Exercise
  • Building a Pivot Table with Multiple Aggregations