Introduction
  • You, this course and Us
Why is Big Data a Big Deal
  • The Big Data Paradigm
  • Serial vs Distributed Computing
  • What is Hadoop?
  • HDFS or the Hadoop Distributed File System
  • MapReduce Introduced
  • YARN or Yet Another Resource Negotiator
Installing Hadoop in a Local Environment
  • Hadoop Install Modes
  • Hadoop Standalone mode Install
  • Hadoop Pseudo-Distributed mode Install
The MapReduce "Hello World"
  • The basic philosophy underlying MapReduce
  • MapReduce - Visualized And Explained
  • MapReduce - Digging a little deeper at every step
  • "Hello World" in MapReduce
  • The Mapper
  • The Reducer
  • The Job
Run a MapReduce Job
  • Get comfortable with HDFS
  • Run your first MapReduce Job
Juicing your MapReduce - Combiners, Shuffle and Sort and The Streaming API
  • Parallelize the reduce phase - use the Combiner
  • Not all Reducers are Combiners
  • How many mappers and reducers does your MapReduce have?
  • Parallelizing reduce using Shuffle And Sort
  • MapReduce is not limited to the Java language - Introducing the Streaming API
  • Python for MapReduce
HDFS and Yarn
  • HDFS - Protecting against data loss using replication
  • HDFS - Name nodes and why they're critical
  • HDFS - Checkpointing to backup name node information
  • Yarn - Basic components
  • Yarn - Submitting a job to Yarn
  • Yarn - Plug in scheduling policies
  • Yarn - Configure the scheduler
MapReduce Customizations For Finer Grained Control
  • Setting up your MapReduce to accept command line arguments
  • The Tool, ToolRunner and GenericOptionsParser
  • Configuring properties of the Job object
  • Customizing the Partitioner, Sort Comparator, and Group Comparator
The Inverted Index, Custom Data Types for Keys, Bigram Counts and Unit Tests!
  • The heart of search engines - The Inverted Index
  • Generating the inverted index using MapReduce
  • Custom data types for keys - The Writable Interface
  • Represent a Bigram using a WritableComparable
  • MapReduce to count the Bigrams in input text
  • Setting up your Hadoop project
  • Test your MapReduce job using MRUnit
Input and Output Formats and Customized Partitioning
  • Introducing the File Input Format
  • Text And Sequence File Formats
  • Data partitioning using a custom partitioner
  • Make the custom partitioner real in code
  • Total Order Partitioning
  • Input Sampling, Distribution, Partitioning and configuring these
  • Secondary Sort
Recommendation Systems using Collaborative Filtering
  • Introduction to Collaborative Filtering
  • Friend recommendations using chained MR jobs
  • Get common friends for every pair of users - the first MapReduce
  • Top 10 friend recommendation for every user - the second MapReduce
Hadoop as a Database
  • Structured data in Hadoop
  • Running an SQL Select with MapReduce
  • Running an SQL Group By with MapReduce
  • A MapReduce Join - The Map Side
  • A MapReduce Join - The Reduce Side
  • A MapReduce Join - Sorting and Partitioning
  • A MapReduce Join - Putting it all together
K-Means Clustering
  • What is K-Means Clustering?
  • A MapReduce job for K-Means Clustering
  • K-Means Clustering - Measuring the distance between points
  • K-Means Clustering - Custom Writables for Input/Output
  • K-Means Clustering - Configuring the Job
  • K-Means Clustering - The Mapper and Reducer
  • K-Means Clustering : The Iterative MapReduce Job
Setting up a Hadoop Cluster
  • Manually configuring a Hadoop cluster (Linux VMs)
  • Getting started with Amazon Web Servicies
  • Start a Hadoop Cluster with Cloudera Manager on AWS
Appendix
  • Setup a Virtual Linux Instance (For Windows users)
  • [For Linux/Mac OS Shell Newbies] Path and other Environment Variables