Big Data Revolution Training Course

Training » All Courses » Big Data » Big Data Revolution

Course Summary

This is a fast paced, technical overview of the NoSQL landscape. No prior knowledge of databases or programming is assumed. This survey course is targeted towards both technical and non-technical people who want to understand the emerging world of Big Data. In each sub-topic, the instructor will provide links and resource recommendations for students who want to explore that area further (for example, YouTube videos, books, blog posts). The end of the course will be an open 30 minute Q&A session where any topics in the world of big data can be explored further. The students will be given a ~100 page slide deck which can be used as reference material after the course. A PDF printout of the specific examples demoed in the 5 live demos will also be given to students. Interested students can re-run the demos in their own time after the course using the PDFs.

[top] Duration

1 day.

[top] Objectives

  • Introduce students to the core concepts of Big Data
  • Provide a general overview of the most common NoSQL stores
  • Explain how to choose the correct NoSQL database for specific use cases
  • Deep Dive into the architecture of Hadoop (HDFS/MapReduce) and Cassandra
  • General overview of the architecture of MongoDB and Neo4J

[top] Audience

Engineers, Programmers, Networking specialists, Managers, Executives

[top] Outline

  • Session 1: The Dawn of Big Data (1.5 hours: 9am to 10:30am)
    • Reality or Hype?
    • The data deluge: the generators of Big Data
    • The limitations of SQL/RDBMS
    • Parallel Computer vs. Distributed Computing
    • The white papers that started it all: GFS, MapReduce, BigTable, Dynamo
    • CAP Theorem: Consistency, Availability, Partition Tolerance
    • NoSQL flavors: Key-Value stores (Voldemort, Dynamo), Key-Data stores (Redis), Key-Document stores (CouchDB, MongoDB, Riak), Column Family stores (BigTable, HBase, Cassandra), Graph stores (Neo4J, HyperGraphDB)
    • Use cases for various NoSQL databases  
  • Session 2: Hadoop Deep Dive (2 hours: 10:45 – noon, 1pm – 1:45pm )
    • Masters & Slaves
    • Example Hadoop clusters at Yahoo! and Facebook
    • Vendor comparison: Cloudera vs. Hortonworks
    • HDFS Architecture: NameNode + DataNodes
    • HDFS HA: What’s the real deal? SecondaryNameNode? StandbyNameNode?
    • Write pipeline
    • Read pipeline
    • Heartbeats and Rack Awareness
    • HDFS next-gen
    • 15 mins Live Demo: Exploring the HDFS cmd line and HDFS web GUI, loading data in HDFS, pulling data from HDFS
    • MapReduce Architecture: JobTracker + TaskTrackers
    • Data Locality MapReduce details: keys/values, shuffle/sort, combiner, partitioner
    • Thinking the MapReduce way MapReduce next-gen: YARN, ResourceManager, NodeManager, ApplicationMaster
    • 15 mins Live Demo: Submitting a MapReduce word count job, monitoring it via the web GUI and reading its results
    • The most important Hadoop ecosystem projects: Hive, Pig, Oozie, HBase, Mahout, Scoop, Talend
  • Session 3: Cassandra Deep Dive (1 hour: 2pm to 3pm)
    • Ring Architecture
    • Gossip protocol, anti-entropy, failure detection and recovery
    • Data partitioning: one data center vs. multiple data centers, random partitioner vs. ordered partitioner, hot spots, load balancing
    • Replica placement strategy and Snitches: SimpleStrategy, NetworkTopologyStrategy
    • Write pipeline
    • Read pipeline (Bloom filters)
    • Hadoop Integration
    • 15 mins Live Demo: exploring the Cassandra CLI, create a keyspace and column family. Insert rows and columns, Read cells, index a column, delete rows and columns, drop a column family and keyspace.
  • Session 4: MongoDB overview (30 mins: 3pm to 3:30pm)
    • Document-oriented storage: BSON specification, Indexing
    • MongeDB architecture: Replication, HA, Auto-sharding
    • Wire protocol: Communication stream
    • Inserting a document
    • Querying a Collection
    • MapReduce
    • 10 mins Live Demo: manipulating data from the MongoDB shell, create a database connection, insert data into a collection, access data via a query
  • Session 5: Neo4J overview (30 mins: 3:30pm to 4:00pm)
    • Use cases for graph databases
    • Neo4J architecture and fundamentals
    • Nodes and Relationships
    • Querying a graph with a traversal
    • Index lookups
    • 10 mins Live Demo: manipulating data using Cypher, add a social network to the graph, query the social network and relationships between people
  • Session 6: New Horizons (4pm to 4:30pm)
    • How to select a NoSQL database for your use case
    • The Big Data job market: hard facts
    • Recommendations on how to hire talent
    • Google has left the party: Colossus and Caffeine
    • The NSA has entered the room: Accumulo
    • The rise of machine learning: recommendation, clustering, classification, frequent item set mining
    • Thoughts on the future of the information age
  • Session 7: Q & A (4:30pm to 5pm)
    • An open classroom discussion on whatever topics students want to discuss