Http://www.chinahadoop.cn/page/developer
What is a big data developer?
The system-level developers around the big data platform are familiar with the core framework of the mainstream big data platforms such as Hadoop, Spark, and Storm. In-depth understanding of how to write MapReduce job and job flow management to complete the calculation of the data, and the ability to use the universal algorithm provided by Hadoop, master the whole Hadoop ecosystem components such as: Yarn,hbase, Hive, pig and other important components, to achieve platform monitoring and the development of auxiliary operation and maintenance system.
Learn the tools and skills to design and develop big data systems or platforms by learning a range of big data platform development technologies for developers such as Hadoop, Spark, and the ability to deploy, develop, and manage distributed computing frameworks such as Hadoop and spark cluster environments, such as performance improvements, feature extensions, Fault analysis and so on.
Follow the developer path:
1, "Hadoop Big Data Platform Foundation"
- Learn the MapReduce program required to write a production environment
- Master the Advanced APIs required for real-world data analysis
1th Week Hadoop Ecosystem Overview and version evolution
Provides an overview of the Hadoop ecosystem and its version evolution history, and gives recommendations for Hadoop version selection.
2nd Week HDFS 2.0 principle, characteristics and basic architecture
Introduces the principle and architecture of HDFS 2.0 and compares it with HDFS 1.0. Introduces the new features of HDFs 2.0, including snapshots, caches, heterogeneous storage architectures, and more
3rd Week Yarn Application scenario, basic architecture and resource scheduling
This paper introduces what yarn is, its basic principle and architecture, and analyzes its scheduling strategy.
4th Week MapReduce 2.0 Fundamentals and Architecture
Introduces the basic principle and architecture of computational framework MapReduce
5th Week MapReduce 2.0 Programming practice (involving multilingual programming)
How to write a mapreduce program in Java, C + +, PHP and other languages
6th Week HBase Application scenario, principle and basic architecture
Introducing HBase scenarios, principles, and architectures
7th Week HBase Programming practice (involving multi-language programming)
Hands-on how to write HBase client programs in Java, C + +, Python, and other languages.
8th Week HBase Case analysis
This paper introduces several typical cases of hbase, including internet application case and bank application case.
9th Week Zookeeper deployment and typical applications
Describe what zookeeper is and what it is in the Hadoop ecosystem
10th Week Hadoop Data warehousing system Flume and Sqoop
Describes how to use flume and Sqoop to import data from external streaming data (such as site logs, user behavior data, etc.), relational databases (such as MySQL, Oracle, etc.) into Hadoop for analysis and mining
11th Week data Analysis System hive and pig application and comparison
Describes how to use hive and pig to analyze massive amounts of data in Hadoop
12th Week Data Mining Toolkit Mahout
Describes how to use data mining and machine learning algorithms provided by Mahout for massive data mining
13th Week Workflow Engine Oozie and Azkaban application
Describes how to use Oozie and Azkaban to manage and dispatch mapreduce jobs, pig/hive jobs, etc.
14th Week Two comprehensive cases: Log analysis system and machine learning platform
This paper introduces two typical internet application cases, and further insights into the application scenarios of each system in the Hadoop ecosystem and how to solve practical problems.
2, "Big Data pre-course series--scala"
- Learning Spark's essentials, a new OOP programming language
- Mastering the use of functional programming concepts in object-oriented programming
-
-
-
-
-
-
-
-
-
- declaration of values and variables
- Introduction to Common types
- definition and use of functions and methods
- conditional Expression
- looping and advanced for loop use
- Lazy Value
- default parameters, named parameters, and variable length parameters
- Exception Handling
- array-related actions
- map Operation
second week Scala object-oriented programming
-
-
-
-
-
-
-
-
- The
- class defines the properties of the
- class
- main constructor
- Secondary constructor
- Object Object
- Apply method
- class inheritance
method overrides and field overrides
- abstract class
- trait
- package definition and use
- package Object definition and use
- file access
-
-
-
-
-
-
-
-
-
- definition of higher-order functions
- value Function
- Anonymous Functions
- Closing the package
- Sam and Curry
- examples of higher-order functions
Introduction to the
- collection
- sequence
- mutable lists and immutable lists
- Collection Operations
- Case Class
- pattern matching
-
-
-
-
-
-
-
-
- Generic class
- Generic functions
- Lower bounds and Upper bounds
- View bounds
- Context bounds
- Covariance and Contravariance
- Implicit conversions
- Implicitly-parameter
- Implicitly-Class
3, "Spark Big Data Platform Foundation"
- Learn about memory-based batch and streaming data analysis methods
- Mastering optimized applications for fast, easy-to-use purposes
First week spark eco-system overview and programming model
-
- Spark Eco-System Overview
- Review Hadoop MapReduce
- Spark Run mode
- RDD
- Introduction to the Spark run-time model
- Introduction to Caching policies
- Transformation
- Action
- Lineage
- Fault-tolerant processing
- Wide dependence and narrow dependence
- Cluster configuration
Next week dive into the spark kernel
-
- Spark Terminology Explained
- Cluster overview
- Core components
- Data locality
- Common Rdd
- Task scheduling
- Dagscheduler
- TaskScheduler
- Task Details
- Broadcast variables
- Accumulator
- Performance tuning
The third week Spark streaming principle and practice
-
- DStream
- Data source
- Stateless transformation and stateful transformation
- Checkpoint
- Fault tolerant
- Performance optimization
Principle and practice of shark in the four weeks
-
- Data model
- Data type
- Shark architecture
- Shark deployment
- Cache (partition) table
- Sharkserver
- Shark combined with Spark
Week five machine learning on Spark
-
- Linearregression
- K-means
- Collaborative Filtering
Six weeks of spark multilingual programming
-
- About Python
- Pyspark API
- Writing Spark programs using Python
- Spark with Java
Seventh Week Spark SQL
-
- Schemas and instances
- Parquet Support
- Dsl
- SQL on RDD
Eighth weekly calculation Graphx
-
- The existing Diagram calculation framework
- Table Operators
- Graph Operators
- GRAPHX Design
Nineth Week Spark on Yarn
-
- Spark on Yarn principle
- Spark on Yarn Practice
Tenth Week JobServer
-
- Overall architecture
- API Introduction
- Configuration and Deployment
4, "Hadoop advanced"
- In-depth study of MapReduce and its job commissioning and optimization methods
- Deep mastery of HDFs and system-level operations and performance optimization methods
The first part. Mapreduce
MapReduce Workflow and Basic architecture review
Operation and Maintenance related
-
- Parameter tuning
- Benchmark
- Reuse JVM
- Error awareness and speculative execution
- Task Log Analysis
- Tolerance for error percentage setting and skipping bad records
- Select other schedulers such as Fairescheduler to optimize performance
Development-related
-
- Data type selection
- Implement custom writable data types, custom keys
- Output different types of value in a mapper
- Inputformat/outputformat, principle and customization
- The use of Mapper/reducer/combiner,combiner and its effect on the MapReduce framework optimization
- Partitioner Custom
- Sorting Policy Groupingcomparator/sortcomparator
- Task scheduling principle and modification method (case, Map/reduce shared slot, accurate map/reduce dispatch by identity)
- Streaming
- Distributedcache
- dependencies between MapReduce tasks
- Counter
- Jobchild parameter settings
- Performance optimization
The second part. Hdfs
HDFS API
Fuse (C API)
Compression
HDFS Benchmark
Datanode Adding and removing
Multi-disk support, disk error-aware
HDFs raid
HDFS block Size setting related issues
File Backup number settings
Merging files in HDFs
The third part. Hadoop Tools
Dfsadmin/mradmin/balancer/distcp/fsck/fs/job
Monitoring and alerting
Hadoop Configuration Management
Part IV. Hadoop debugging
Log
Local mode debugging Map/reduce tasks
Remote debugging
Part Five. Problem analysis
Java GC Introduction and common analysis tools for Java processes Jstat, Jhat, Jmap
Top/iostat/netstat/lsof, etc.
Jstack/kill-3
Strace
Nload/tcpdump
Part VI. Analysis examples
Simple analysis of MapReduce
Implement group-by with MapReduce
Using MapReduce to implement inverted index
MapReduce Implementation Histogram
MapReduce implements Join
5, "HBase Advanced"
- Learn to design a reasonable schema for a massive data set
- Mastering HBase Performance optimization methods and usage scenarios
6. "SQL on Hadoop"
- Learn about hive SQL parsing and performance optimization, Impala task generation, and more
- Mastering the Open platform for building data using SQL on Hadoop
7, "Hadoop/spark Enterprise Application Practical"
- Learn how to use Hadoop and spark in production systems
- Mastering solutions for integration with existing enterprise BI platforms
The road to Big data learning