Hadoop with Python Training Course

Overview

Hadoop is a popular Big Data processing framework. Python is a high-level programming language famous for its clear syntax and code readibility.

In this instructor-led, live training, participants will learn how to work with Hadoop, MapReduce, Pig, and Spark using Python as they step through multiple examples and use cases.

By the end of this training, participants will be able to:

Understand the basic concepts behind Hadoop, MapReduce, Pig, and Spark
Use Python with Hadoop Distributed File System (HDFS), MapReduce, Pig, and Spark
Use Snakebite to programmatically access HDFS within Python
Use mrjob to write MapReduce jobs in Python
Write Spark programs with Python
Extend the functionality of pig using Python UDFs
Manage MapReduce jobs and Pig scripts using Luigi

Audience

Developers
IT Professionals

Format of the course

Part lecture, part discussion, exercises and heavy hands-on practice

Requirements

Experience with Python programming
Basic familiarity with Hadoop

Course Outline

Introduction

Understanding Hadoop’s Architecture and Key Concepts

Understanding the Hadoop Distributed File System (HDFS)

Overview of HDFS and its Architectural Design
Interacting with HDFS
Performing Basic File Operations on HDFS
Overview of HDFS Command Reference
Overview of Snakebite
Installing Snakebite
Using the Snakebite Client Library
Using the CLI Client

Learning the MapReduce Programming Model with Python

Overview of the MapReduce Programming Model
Understanding Data Flow in the MapReduce Framework
- Map
- Shuffle and Sort
- Reduce
Using the Hadoop Streaming Utility
- Understanding How the Hadoop Streaming Utility Works
- Demo: Implementing the WordCount Application on Python
Using the mrjob Library
- Overview of mrjob
- Installing mrjob
- Demo: Implementing the WordCount Algorithm Using mrjob
- Understanding How a MapReduce Job Written with the mrjob Library Works
- Executing a MapReduce Application with mrjob
- Hands-on: Computing Top Salaries Using mrjob

Learning Pig with Python

Overview of Pig
Demo: Implementing the WordCount Algorithm in Pig
Configuring and Running Pig Scripts and Pig Statements
- Using the Pig Execution Modes
- Using the Pig Interactive Mode
- Using the Pic Batch Mode
Understanding the Basic Concepts of the Pig Latin Language
- Using Statements
- Loading Data
- Transforming Data
- Storing Data
Extending Pig’s Functionality with Python UDFs
- Registering a Python UDF File
- Demo: A Simple Python UDF
- Demo: String Manipulation Using Python UDF
- Hands-on: Calculating the 10 Most Recent Movies Using Python UDF

Using Spark and PySpark

Overview of Spark
Demo: Implementing the WordCount Algorithm in PySpark
Overview of PySpark
- Using an Interactive Shell
- Implementing Self-Contained Applications
Working with Resilient Distributed Datasets (RDDs)
- Creating RDDs from a Python Collection
- Creating RDDs from Files
- Implementing RDD Transformations
- Implementing RDD Actions
Hands-on: Implementing a Text Search Program for Movie Titles with PySpark

Managing Workflow with Python

Overview of Apache Oozie and Luigi
Installing Luigi
Understanding Luigi Workflow Concepts
- Tasks
- Targets
- Parameters
Demo: Examining a Workflow that Implements the WordCount Algorithm
Working with Hadoop Workflows that Control MapReduce and Pig Jobs
- Using Luigi’s Configuration Files
- Working with MapReduce in Luigi
- Working with Pig in Luigi

Summary and Conclusion

Posts

Overview

Requirements

Course Outline

Leave a Reply Cancel reply