- CCA 175 Spark and Hadoop Developer - Curriculum
- Using labs for preparation
- Setup Development Environment (Windows 10) - Introduction
- Setup Development Environment - Python and Spark - Pre-requisites
- Setup Development Environment - Python Setup on Windows
- Setup Development Environment - Configure Environment Variables
- Setup Development Environment - Setup PyCharm for developing Python applications
- Setup Development Environment - Pass run time arguments or parameters
- Setup Development Environment - Download Spark compressed tar ball
- Setup Development Environment - Install 7z for uncompress and untar on windows
- Setup Development Environment - Setup Spark
- Setup Development Environment - Install JDK
- Setup Development Environment - Configure environment variables for Spark
- Setup Development Environment - Install WinUtils - integrate Windows and HDFS
- Setup Development Environment - Integrate PyCharm and Spark on Windows 10
- Introduction and Setting up Python
- Basic Programming Constructs
- Functions in Python
- Python Collections
- Map Reduce operations on Python Collections
- Setting up Data Sets for Basic I/O Operations
- Basic I/O operations and processing data using Collections
- Get revenue for given order id - as application
- Setup Environment - Options
- Setup Environment - Locally
- Setup Environment - using Cloudera Quickstart VM
- Using Itversity platforms - Big Data Developer labs and forum
- Using itversity's big data labs
- Using Windows - Putty and WinSCP
- Using Windows - Cygwin
- HDFS Quick Preview
- YARN Quick Preview
- Setup Data Sets
- Introduction
- Introduction to Spark
- Setup Spark on Windows
- Quick overview about Spark documentation
- Connecting to the environment
- Initializing Spark job using pyspark
- Create RDD from HDFS files
- Create RDD from collection - using parallelize
- Read data from different file formats - using sqlContext
- Row level transformations - String Manipulation
- Row Level Transformations - map
- Row Level Transformations - flatMap
- Filtering data using filter
- Joining Data Sets - Introduction
- Joining Data Sets - Inner Join
- Joining Data Sets - Outer Join
- Aggregations - Introduction
- Aggregations - count and reduce - Get revenue for order id
- Aggregations - reduce - Get order item with minimum subtotal for order id
- Aggregations - countByKey - Get order count by status
- Aggregations - understanding combiner
- Aggregations - groupByKey - Get revenue for each order id
- groupByKey - Get order items sorted by order_item_subtotal for each order id
- Aggregations - reduceByKey - Get revenue for each order id
- Aggregations - aggregateByKey - Get revenue and count of items for each order id
- Sorting - sortByKey - Sort data by product price
- Sorting - sortByKey - Sort data by category id and then by price descending
- Ranking - Introduction
- Ranking - Global Ranking using sortByKey and take
- Ranking - Global using takeOrdered or top
- Ranking - By Key - Get top N products by price per category - Introduction
- Ranking - By Key - Get top N products by price per category - Python collections
- Ranking - By Key - Get top N products by price per category - using flatMap
- Ranking - By Key - Get top N priced products - Introduction
- Ranking - By Key - Get top N priced products - using Python collections API
- Ranking - By Key - Get top N priced products - Create Function
- Ranking - By Key - Get top N priced products - integrate with flatMap
- Set Operations - Introduction
- Set Operations - Prepare data
- Set Operations - union and distinct
- Set Operations - intersect and minus
- Saving data into HDFS - text file format
- Saving data into HDFS - text file format with compression
- Saving data into HDFS using Data Frames - json
- Problem Statement
- Launching pyspark
- Reading data from HDFS and filtering
- Joining orders and order_items
- Aggregate to get daily revenue per product id
- Load products and convert into RDD
- Join and sort the data
- Save to HDFS and validate in text file format
- Saving data in avro file format
- Get data to local file system using get or copyToLocal
- Develop as application to get daily revenue per product
- Run as application on the cluster
- Different interfaces to run SQL - Hive, Spark SQL
- Create database and tables of text file format - orders and order_items
- Create database and tables of ORC file format - orders and order_items
- Running SQL/Hive Commands using pyspark
- Functions - Getting Started