- CCA 175 Spark and Hadoop Developer - Curriculum
- Introduction and Setting up of Scala
- Setup Scala on Windows
- Basic Programming Constructs
- Functions
- Object Oriented Concepts - Classes
- Object Oriented Concepts - Objects
- Object Oriented Concepts - Case Classes
- Collections - Seq, Set and Map
- Basic Map Reduce Operations
- Setting up Data Sets for Basic I/O Operations
- Basic I/O Operations and using Scala Collections APIs
- Tuples
- Development Cycle - Developing Source code
- Development Cycle - Compile source code to jar using SBT
- Development Cycle - Setup SBT on Windows
- Development Cycle - Compile changes and run jar with arguments
- Development Cycle - Setup IntelliJ with Scala
- Development Cycle - Develop Scala application using SBT in IntelliJ
- Introduction and Curriculum
- Setup Environment - Options
- Setup Environment - Locally
- Setup Environment - using Cloudera Quickstart VM
- Using Windows - Putty and WinSCP
- Using Windows - Cygwin
- HDFS Quick Preview
- YARN Quick Preview
- Setup Data Sets
- Introduction
- Introduction to Spark
- Setup Spark on Windows
- Quick overview about Spark documentation
- Initializing Spark job using spark-shell
- Create Resilient Distributed Data Sets (RDD)
- Previewing data from RDD
- Reading different file formats - Brief overview using JSON
- Transformations Overview
- Manipulating Strings as part of transformations using Scala
- Row level transformations using map
- Row level transformations using flatMap
- Filtering the data
- Joining data sets - inner join
- Joining data sets - outer join
- Aggregations - Getting Started
- Aggregations - using actions (reduce and countByKey)
- Aggregations - understanding combiner
- Aggregations using groupByKey - least preferred API for aggregations
- Aggregations using reduceByKey
- Aggregations using aggregateByKey
- Sorting data using sortByKey
- Global Ranking - using sortByKey with take and takeOrdered
- By Key Ranking - Converting (K, V) pairs into (K, Iterable[V]) using groupByKey
- Get topNPrices using Scala Collections API
- Get topNPricedProducts using Scala Collections API
- Get top n products by category using groupByKey, flatMap and Scala function
- Set Operations - union, intersect, distinct as well as minus
- Save data in Text Input Format
- Save data in Text Input Format using Compression
- Saving data in standard file formats - Overview
- Revision of Problem Statement and Design the solution
- Solution - Get Daily Revenue per Product - Launching Spark Shell
- Solution - Get Daily Revenue per Product - Read and join orders and order_items
- Solution - Get Daily Revenue per Product - Compute daily revenue per product id
- Solution - Get Daily Revenue per Product - Read products data and create RDD
- Solution - Get Daily Revenue per Product - Sort and save to HDFS
- Solution - Add spark dependencies to sbt
- Solution - Develop as Scala based application
- Solution - Run locally using spark-submit
- Solution - Ship and run it on big data cluster
- Introduction to Setting up Enviroment for Practice
- Overview of ITVersity Boxes GitHub Repository
- Creating Virtual Machine
- Starting HDFS and YARN
- Gracefully Stopping Virtual Machine
- Undertanding Datasets provided in Virtual Machine
- Using GitHub Content for the practice
- Using Resources for Practice
- Introduction for the module
- Starting Spark Context
- Overview of Spark read APIs
- Previewing Schema and Data
- Overview of Data Frame APIs
- Overview of Functions
- Overview of Spark Write APIs
- Introduction to Pre-defined Functions
- Creating Spark Session Object in Notebook
- Create Dummy Data Frames for Practice
- Categories of Functions
- Using Special Functions - col
- Using Special Functions - lit
- String Manipulation Functions - Case Conversion and Length
- String Manipulation - Extracting data from fixed lengith fields using substring
- String Manipulation - Extracting data from delimited fields using split