High-order application of big data based on Hadoop2.0 and YARN technology (hadoop2.0\yarn\mapreduce\ Data mining \ Project Combat)
Course Category: Hadoop
Suitable for people: advanced
Number of lessons: 81 hours
Use of technology: Recommendation system based on collaborative filtering, HBase-based crawler scheduling library
Related projects: Bank RMB Inquiry system, hbase programming practice and Case analysis
Consulting qq:1840215592
Course Content Introduction
This course is based on the basic course of the Hadoop part of the big Data solution based on the Greenplum Hadoop distributed platform, which is divided into the following four parts:
First, the latest 2.0 series version of Hadoop and yarn are introduced to master the forefront of the Hadoop technology framework.
Second, for the MapReduce and HBase high-order applications to do in-depth explanation and actual combat drills.
Iii. Hadoop sub-projects not covered in the basic article, including Cassandra, Sqoop, Avatar, Mahout, Avro, flume, etc.
Iv. combination of Hadoop and R, the basics of source code reading for Hadoop and the final comprehensive combat
Hadoop2.0, yarn technology Big Data Video Tutorial detailed introduction: http://www.ibeifeng.com/goods-440.html
Suitable objects:
1, requires a certain Linux and Java Foundation
2, requires a certain SQL language foundation
3. Learn the basic course of Hadoop part of "Big data solution based on Greenplum Hadoop distributed platform"
Course Outline
Hadoop Advanced Application Course (81 hours)
Hadoop 2.0 (6 hours)
Hadoop 2.0 generates a background
Basic configuration of Hadoop 2.0
HDFS 2.0
MapReduce 2.0
Hadoop 2.0 Installation Configuration
Cluster testing
Yarn Resource management System (4 hours)
Yarn produces background
Yarn Basic Design Idea
Yarn Basic Architecture
Yarn Work Flow
Yarn Communication Protocol
Yarn Fault Tolerance
Yarn Resource Scheduling mechanism
Yarn Supported computing Framework (Storm,tez,spark) (11 hours)
Yarn as the core of the eco-system
Storm Basic concepts
Storm Flow Computing Framework
Yarn-based Storm architecture
Yarn-storm deployment
Storm on yarn Service
Apache Tez Introduction
Tez Features
Tez Data processing engine
Dagappmaster implementation
Tez optimization mechanism
Tez Application Scenario
Tez deployment
What is Spark
Spark Eco-System
Spark's core--rdd and lineage
The storage, fault-tolerant mechanism, internal design and data model of RDD
Spark Dispatch Framework
How to deploy spark in a distributed manner
Mesos-based Spark mode
Yarn-based Spark mode
Standalone mode deployment for Spark
Spark's Yarn mode deployment
MapReduce Multi-language programming (5 hours)
MapReduce Programming Interface
Java Programming Interface Instance parsing
How Hadoop streaming implemented
Hadoop Streaming Programming Combat (C++,php,python)
Analysis of the principle of Hadoop streaming
Programming examples for Hadoop pipes
Analysis of the principle of Hadoop pipes
MapReduce Advanced Implementations (14 hours)
Complex MapReduce applications
K-means clustering, Bayesian classification, etc.
An analysis of workflow programming examples and principles
Jobcontrol, Chainmapper/chainreducer
Hadoop Workflow Engine
Common MapReduce optimization techniques
Configuring multiple Reducer
Set the processing format for stream
Controlling the size of shards
Avoid sharding
Input format: text input, multiple types of input
Output control: Multiple outputs, delayed outputs
Combat: Data Partitioning
MapReduce Advanced Features
Counters, built-in counters
Example: User-defined counter
Implementation of partial sequencing of MapReduce
Example: MapReduce full sequencing
Terasort Algorithm Analysis
Example: MapReduce implements two ordering
Implementation of connection, map-side connection
Example: Reduce End connection
Connection type, connection policy introduction
Implementation of the partitioned connection framework
Implementation of the replication connection framework
Example: half-connection
Global job parameters/data File delivery
HBase programming practices and case studies (10 hours)
HBase Foundation explaining
HBase Java Programming Instance
HBase Multi-language programming
Thrift Installation, service configuration
HBase C + + programming instance
HBase Python Programming instance
HBase MapReduce Programming Basics
Actual combat: HBase MapReduce Programming
HBase case: implementation of OPENTSDB
The crawler scheduling library based on HBase
The crawler index library based on HBase
Bank RMB Inquiry System
Sqoop (6 hours)
Sqoop produce background, basic
SQOOP1 and SQOOP2 Architecture and features
SQOOP1 installation Configuration (version 1.4.4)
Introduction to Sqoop Import
Actual combat: Import data from MySQL to HDFs
Actual combat: Import data from MySQL to Hive
Sqoop Export Introduction
Combat: Export hive data to MySQL
Combination of Sqoop and hbase
Sqoop Job Operations
Sqoop Job Security Configuration
SQOOP2 installation Configuration (version 1.99.3)
SQOOP2 Use comprehensive combat
Flume Log Collection system (7 hours)
Flume Concepts and Features
Flume og architecture, composition, characteristics, fault-tolerant mechanism design
Comprehensive comparison of log collection systems
Flume ng architecture, Core concepts
Installation of the Flume og
Configuration of the Flume og (web-side, Flume shell)
Flume Ng's installation configuration, testing
Flume ng module Configuration (Source, Channel, Sink)
Flume NG Configuration Combat analysis
Avro data serialization System (1 hours)
Avro Introduction
Avro characteristics, main functions
RPC uses AVRO
The difference between Avro and other serialization systems
Mahout Data Mining Tools (10 hours)
Data mining concepts, system composition
Common methods and algorithms for data Mining (regression analysis, classification, clustering, etc.)
Data Mining analysis tools
Mahout supported Algorithms
Mahout origin and characteristics
Mahout installation, configuration and testing
Actual combat: Mahout K-means Cluster analysis
Mahout implementation of canopy algorithm
Mahout Implementation Classification algorithm
Actual combat: Mahout Logistic Regression classification prediction
Actual combat: Mahout naive Bayesian classification
Concept and classification of recommendation systems
Concept, classification and application of collaborative filtering recommendation algorithm
Actual combat: Implementation of Mahout-based film recommendation system
Hadoop Integrated Combat-text mining project (7 hours)
The concept of text mining and its application scenario
Project background
Project Flow
Chinese Word segmentation technology
The use of Cook looked through word breaker
Design and implementation of MapReduce parallel Word segmentation Program
Pig Partition Data Set
Mahout constructing naive Bayesian text classifier
Model application-Calculating user preference categories
Hadoop2.0, yarn technology Big Data Video tutorial