The road to Big data learning

Last Update:2014-12-14 Source: Internet

Author: User

Tags hadoop mapreduce hadoop ecosystem sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.chinahadoop.cn/page/developer

What is a big data developer?

The system-level developers around the big data platform are familiar with the core framework of the mainstream big data platforms such as Hadoop, Spark, and Storm. In-depth understanding of how to write MapReduce job and job flow management to complete the calculation of the data, and the ability to use the universal algorithm provided by Hadoop, master the whole Hadoop ecosystem components such as: Yarn,hbase, Hive, pig and other important components, to achieve platform monitoring and the development of auxiliary operation and maintenance system.

Learn the tools and skills to design and develop big data systems or platforms by learning a range of big data platform development technologies for developers such as Hadoop, Spark, and the ability to deploy, develop, and manage distributed computing frameworks such as Hadoop and spark cluster environments, such as performance improvements, feature extensions, Fault analysis and so on.

Follow the developer path:

1, "Hadoop Big Data Platform Foundation"

Learn the MapReduce program required to write a production environment
Master the Advanced APIs required for real-world data analysis

1th Week Hadoop Ecosystem Overview and version evolution
Provides an overview of the Hadoop ecosystem and its version evolution history, and gives recommendations for Hadoop version selection.

2nd Week HDFS 2.0 principle, characteristics and basic architecture
Introduces the principle and architecture of HDFS 2.0 and compares it with HDFS 1.0. Introduces the new features of HDFs 2.0, including snapshots, caches, heterogeneous storage architectures, and more

3rd Week Yarn Application scenario, basic architecture and resource scheduling
This paper introduces what yarn is, its basic principle and architecture, and analyzes its scheduling strategy.

4th Week MapReduce 2.0 Fundamentals and Architecture
Introduces the basic principle and architecture of computational framework MapReduce

5th Week MapReduce 2.0 Programming practice (involving multilingual programming)
How to write a mapreduce program in Java, C + +, PHP and other languages

6th Week HBase Application scenario, principle and basic architecture
Introducing HBase scenarios, principles, and architectures

7th Week HBase Programming practice (involving multi-language programming)
Hands-on how to write HBase client programs in Java, C + +, Python, and other languages.

8th Week HBase Case analysis
This paper introduces several typical cases of hbase, including internet application case and bank application case.

9th Week Zookeeper deployment and typical applications
Describe what zookeeper is and what it is in the Hadoop ecosystem

10th Week Hadoop Data warehousing system Flume and Sqoop
Describes how to use flume and Sqoop to import data from external streaming data (such as site logs, user behavior data, etc.), relational databases (such as MySQL, Oracle, etc.) into Hadoop for analysis and mining

11th Week data Analysis System hive and pig application and comparison
Describes how to use hive and pig to analyze massive amounts of data in Hadoop

12th Week Data Mining Toolkit Mahout
Describes how to use data mining and machine learning algorithms provided by Mahout for massive data mining

13th Week Workflow Engine Oozie and Azkaban application
Describes how to use Oozie and Azkaban to manage and dispatch mapreduce jobs, pig/hive jobs, etc.

14th Week Two comprehensive cases: Log analysis system and machine learning platform
This paper introduces two typical internet application cases, and further insights into the application scenarios of each system in the Hadoop ecosystem and how to solve practical problems.

2, "Big Data pre-course series--scala"

Learning Spark's essentials, a new OOP programming language
Mastering the use of functional programming concepts in object-oriented programming

second week Scala object-oriented programming

- - - - definition of higher-order functions
        
        value Function
        
        Anonymous Functions
        
        Closing the package
        
        Sam and Curry
        
        examples of higher-order functions
        Introduction to the
        collection
        
        sequence
        
        mutable lists and immutable lists
        
        Collection Operations
        
        Case Class
        
        pattern matching

3, "Spark Big Data Platform Foundation"

Learn about memory-based batch and streaming data analysis methods
Mastering optimized applications for fast, easy-to-use purposes

First week spark eco-system overview and programming model

- Spark Eco-System Overview
- Review Hadoop MapReduce
- Spark Run mode
- RDD
- Introduction to the Spark run-time model
- Introduction to Caching policies
- Transformation
- Action
- Lineage
- Fault-tolerant processing
- Wide dependence and narrow dependence
- Cluster configuration

Next week dive into the spark kernel

- Spark Terminology Explained
- Cluster overview
- Core components
- Data locality
- Common Rdd
- Task scheduling
- Dagscheduler
- TaskScheduler
- Task Details
- Broadcast variables
- Accumulator
- Performance tuning

The third week Spark streaming principle and practice

- DStream
- Data source
- Stateless transformation and stateful transformation
- Checkpoint
- Fault tolerant
- Performance optimization

Principle and practice of shark in the four weeks

- Data model
- Data type
- Shark architecture
- Shark deployment
- Cache (partition) table
- Sharkserver
- Shark combined with Spark

Week five machine learning on Spark

- Linearregression
- K-means
- Collaborative Filtering

Six weeks of spark multilingual programming

- About Python
- Pyspark API
- Writing Spark programs using Python
- Spark with Java

Seventh Week Spark SQL

- Schemas and instances
- Parquet Support
- Dsl
- SQL on RDD

Eighth weekly calculation Graphx

- The existing Diagram calculation framework
- Table Operators
- Graph Operators
- GRAPHX Design

Nineth Week Spark on Yarn

- Spark on Yarn principle
- Spark on Yarn Practice

Tenth Week JobServer

- Overall architecture
- API Introduction
- Configuration and Deployment

4, "Hadoop advanced"

In-depth study of MapReduce and its job commissioning and optimization methods
Deep mastery of HDFs and system-level operations and performance optimization methods

The first part. Mapreduce

MapReduce Workflow and Basic architecture review

Operation and Maintenance related

- Parameter tuning
- Benchmark
- Reuse JVM
- Error awareness and speculative execution
- Task Log Analysis
- Tolerance for error percentage setting and skipping bad records
- Select other schedulers such as Fairescheduler to optimize performance

Development-related

- Data type selection
- Implement custom writable data types, custom keys
- Output different types of value in a mapper
- Inputformat/outputformat, principle and customization
- The use of Mapper/reducer/combiner,combiner and its effect on the MapReduce framework optimization
- Partitioner Custom
- Sorting Policy Groupingcomparator/sortcomparator
- Task scheduling principle and modification method (case, Map/reduce shared slot, accurate map/reduce dispatch by identity)
- Streaming
- Distributedcache
- dependencies between MapReduce tasks
- Counter
- Jobchild parameter settings
- Performance optimization

The second part. Hdfs

HDFS API

Fuse (C API)

Compression

HDFS Benchmark

Datanode Adding and removing

Multi-disk support, disk error-aware

HDFs raid

HDFS block Size setting related issues

File Backup number settings

Merging files in HDFs

The third part. Hadoop Tools

Dfsadmin/mradmin/balancer/distcp/fsck/fs/job

Monitoring and alerting

Hadoop Configuration Management

Part IV. Hadoop debugging

Log

Local mode debugging Map/reduce tasks

Remote debugging

Part Five. Problem analysis

Java GC Introduction and common analysis tools for Java processes Jstat, Jhat, Jmap

Top/iostat/netstat/lsof, etc.

Jstack/kill-3

Strace

Nload/tcpdump

Part VI. Analysis examples

Simple analysis of MapReduce

Implement group-by with MapReduce

Using MapReduce to implement inverted index

MapReduce Implementation Histogram

MapReduce implements Join

5, "HBase Advanced"

Learn to design a reasonable schema for a massive data set
Mastering HBase Performance optimization methods and usage scenarios

6. "SQL on Hadoop"

Learn about hive SQL parsing and performance optimization, Impala task generation, and more
Mastering the Open platform for building data using SQL on Hadoop

7, "Hadoop/spark Enterprise Application Practical"

Learn how to use Hadoop and spark in production systems
Mastering solutions for integration with existing enterprise BI platforms

The road to Big data learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More