Hadoop Training Course

Training » All Courses » Hadoop » Hadoop

Course Summary

This 3-day training program is geared to give developers hands-on working knowledge for harnessing the power of Hadoop in their organizations. Hadoop is a software framework that supports data-intensive distributed applications. Hadoop empowers applications to work with thousands of nodes and petabytes of data without exposing the complexity of clustering to the end user.

[top] Duration

3 days.

[top] Objectives

This intense course assumes no prior knowledge of Hadoop or BigData Concepts. It begins with first giving an overview of MapReduce and the Hadoop ecosystem and then works its wayto hands-on exploration with datasets and live clusters. The course goes over some common configuration mechanisms, tools and debugging. Lastly we compare different hadoop distributions including, Amazon's Elastic Map Reduce, Cloudera's CDH & Yahoo's YDH.

Our outline follows The Hadoop Definitive Guide which is a good reading companion for the course. We pepper the hands-on training with slideshows & videos from real world implementations.

[top] Audience

Software developers interested in learning distributed systems concepts, Map Reduce.

[top] Prerequisites

Minimal exposure to database or datawarehouse concepts. Some familiarity with launching scripts, SQL. Exposure or having login to AWS would be preferred but not essential.

[top] Outline

Hadoop: Overview

  • Move computation not data.
  • Hadoop performance and data scale facts.
  • Hadoop in the context of other data stores.
  • The Apache Hadoop Project.
  • Hadoop – an inside view: MapReduce and HDFS.
  • The Hadoop Ecosystem.
  • What about NoSQL?

MapReduce Map and Reduce.

  • Java Map Reduce.
  • Running a Distributed Map.
  • Reduce Job Hadoop Streaming: Python

The Hadoop Distributed Filesystem

  • HDFS Design & Concepts
  • Blocks, Namenodes and Datanodes
  • hadoop fs The Command-Line Interface
  • Basic Filesystem Operations
  • Reading Data from a Hadoop URL
  • Reading Data Using the FileSystem API
  • Data Flow Anatomy of a File Read
  • Anatomy of a File Write Coherency Model

How MapReduce Works

  • Anatomy of a MapReduce Job Run
  • Job Submission Job Initialization, Task Assignment, Task Execution
  • Progress and Status Updates
  • Job Completion, Failures
  • Job Scheduling
  • Fair Scheduler
  • Shuffle and Sort - Map Side, Reduce Side
  • Configuration Tuning
  • Task Execution, Speculative Execution, Task JVM Reuse, Skipping Bad Records
  • The Task Execution Environment
  • Distributed Cache

Hadoop Administrator

  • Setting Up a Hadoop Cluster
  • Cluster Specification
  • Network Topology
  • Cluster Setup and Installation
  • SSH Configuration
  • Hadoop Configuration
  • Configuration Management
  • Environment Settings
  • Important Hadoop Daemon Properties
  • Hadoop Daemon Addresses and Ports
  • Post Install
  • Benchmarking a Hadoop Cluster: TeraByte Sort on Apache
  • Hadoop on Amazon EC2
  • Monitoring, Logging Routine Administration Procedures
  • Commissioning and Decommissioning Nodes
  • Upgrades

Pig

  • Installing and Running Pig
  • Execution Types
  • Running Pig Programs
  • User-Defined Functions

Hive

  • Basic concepts.
  • HiveQL.
  • Serdes
  • Metastore

HBase

  • Concepts Data Model, Schema Design
  • Test Drive
  • Clients Java
  • REST and Thrift
  • Metrics