Introduction
  • CCA 175 Spark and Hadoop Developer - Curriculum
  • Using labs for preparation
  • Setup Development Environment (Windows 10) - Introduction
  • Setup Development Environment - Python and Spark - Pre-requisites
  • Setup Development Environment - Python Setup on Windows
  • Setup Development Environment - Configure Environment Variables
  • Setup Development Environment - Setup PyCharm for developing Python applications
  • Setup Development Environment - Pass run time arguments or parameters
  • Setup Development Environment - Download Spark compressed tar ball
  • Setup Development Environment - Install 7z for uncompress and untar on windows
  • Setup Development Environment - Setup Spark
  • Setup Development Environment - Install JDK
  • Setup Development Environment - Configure environment variables for Spark
  • Setup Development Environment - Install WinUtils - integrate Windows and HDFS
  • Setup Development Environment - Integrate PyCharm and Spark on Windows 10
Python Fundamentals
  • Introduction and Setting up Python
  • Basic Programming Constructs
  • Functions in Python
  • Python Collections
  • Map Reduce operations on Python Collections
  • Setting up Data Sets for Basic I/O Operations
  • Basic I/O operations and processing data using Collections
Getting Started
  • Get revenue for given order id - as application
  • Setup Environment - Options
  • Setup Environment - Locally
  • Setup Environment - using Cloudera Quickstart VM
  • Using Itversity platforms - Big Data Developer labs and forum
  • Using itversity's big data labs
  • Using Windows - Putty and WinSCP
  • Using Windows - Cygwin
  • HDFS Quick Preview
  • YARN Quick Preview
  • Setup Data Sets
Apache Spark 1.6 - Transform, Stage and Store
  • Introduction
  • Introduction to Spark
  • Setup Spark on Windows
  • Quick overview about Spark documentation
  • Connecting to the environment
  • Initializing Spark job using pyspark
  • Create RDD from HDFS files
  • Create RDD from collection - using parallelize
  • Read data from different file formats - using sqlContext
  • Row level transformations - String Manipulation
  • Row Level Transformations - map
  • Row Level Transformations - flatMap
  • Filtering data using filter
  • Joining Data Sets - Introduction
  • Joining Data Sets - Inner Join
  • Joining Data Sets - Outer Join
  • Aggregations - Introduction
  • Aggregations - count and reduce - Get revenue for order id
  • Aggregations - reduce - Get order item with minimum subtotal for order id
  • Aggregations - countByKey - Get order count by status
  • Aggregations - understanding combiner
  • Aggregations - groupByKey - Get revenue for each order id
  • groupByKey - Get order items sorted by order_item_subtotal for each order id
  • Aggregations - reduceByKey - Get revenue for each order id
  • Aggregations - aggregateByKey - Get revenue and count of items for each order id
  • Sorting - sortByKey - Sort data by product price
  • Sorting - sortByKey - Sort data by category id and then by price descending
  • Ranking - Introduction
  • Ranking - Global Ranking using sortByKey and take
  • Ranking - Global using takeOrdered or top
  • Ranking - By Key - Get top N products by price per category - Introduction
  • Ranking - By Key - Get top N products by price per category - Python collections
  • Ranking - By Key - Get top N products by price per category - using flatMap
  • Ranking - By Key - Get top N priced products - Introduction
  • Ranking - By Key - Get top N priced products - using Python collections API
  • Ranking - By Key - Get top N priced products - Create Function
  • Ranking - By Key - Get top N priced products - integrate with flatMap
  • Set Operations - Introduction
  • Set Operations - Prepare data
  • Set Operations - union and distinct
  • Set Operations - intersect and minus
  • Saving data into HDFS - text file format
  • Saving data into HDFS - text file format with compression
  • Saving data into HDFS using Data Frames - json
Apache Spark 1.6 - Core Spark APIs - Get Daily Revenue Per Product
  • Problem Statement
  • Launching pyspark
  • Reading data from HDFS and filtering
  • Joining orders and order_items
  • Aggregate to get daily revenue per product id
  • Load products and convert into RDD
  • Join and sort the data
  • Save to HDFS and validate in text file format
  • Saving data in avro file format
  • Get data to local file system using get or copyToLocal
  • Develop as application to get daily revenue per product
  • Run as application on the cluster
Apache Spark 1.6 - Data Analysis - Spark SQL or HiveQL using Spark Context
  • Different interfaces to run SQL - Hive, Spark SQL
  • Create database and tables of text file format - orders and order_items
  • Create database and tables of ORC file format - orders and order_items
  • Running SQL/Hive Commands using pyspark
  • Functions - Getting Started