The road to Big data learning

Source: Internet
Author: User
Tags hadoop mapreduce hadoop ecosystem sqoop

Http://www.chinahadoop.cn/page/developer

What is a big data developer?

The system-level developers around the big data platform are familiar with the core framework of the mainstream big data platforms such as Hadoop, Spark, and Storm. In-depth understanding of how to write MapReduce job and job flow management to complete the calculation of the data, and the ability to use the universal algorithm provided by Hadoop, master the whole Hadoop ecosystem components such as: Yarn,hbase, Hive, pig and other important components, to achieve platform monitoring and the development of auxiliary operation and maintenance system.

Learn the tools and skills to design and develop big data systems or platforms by learning a range of big data platform development technologies for developers such as Hadoop, Spark, and the ability to deploy, develop, and manage distributed computing frameworks such as Hadoop and spark cluster environments, such as performance improvements, feature extensions, Fault analysis and so on.

Follow the developer path:

1, "Hadoop Big Data Platform Foundation"

    • Learn the MapReduce program required to write a production environment
    • Master the Advanced APIs required for real-world data analysis

1th Week Hadoop Ecosystem Overview and version evolution
Provides an overview of the Hadoop ecosystem and its version evolution history, and gives recommendations for Hadoop version selection.

2nd Week HDFS 2.0 principle, characteristics and basic architecture
Introduces the principle and architecture of HDFS 2.0 and compares it with HDFS 1.0. Introduces the new features of HDFs 2.0, including snapshots, caches, heterogeneous storage architectures, and more

3rd Week Yarn Application scenario, basic architecture and resource scheduling
This paper introduces what yarn is, its basic principle and architecture, and analyzes its scheduling strategy.

4th Week MapReduce 2.0 Fundamentals and Architecture
Introduces the basic principle and architecture of computational framework MapReduce

5th Week MapReduce 2.0 Programming practice (involving multilingual programming)
How to write a mapreduce program in Java, C + +, PHP and other languages

6th Week HBase Application scenario, principle and basic architecture
Introducing HBase scenarios, principles, and architectures

7th Week HBase Programming practice (involving multi-language programming)
Hands-on how to write HBase client programs in Java, C + +, Python, and other languages.

8th Week HBase Case analysis
This paper introduces several typical cases of hbase, including internet application case and bank application case.

9th Week Zookeeper deployment and typical applications
Describe what zookeeper is and what it is in the Hadoop ecosystem

10th Week Hadoop Data warehousing system Flume and Sqoop
Describes how to use flume and Sqoop to import data from external streaming data (such as site logs, user behavior data, etc.), relational databases (such as MySQL, Oracle, etc.) into Hadoop for analysis and mining

11th Week data Analysis System hive and pig application and comparison
Describes how to use hive and pig to analyze massive amounts of data in Hadoop

12th Week Data Mining Toolkit Mahout
Describes how to use data mining and machine learning algorithms provided by Mahout for massive data mining

13th Week Workflow Engine Oozie and Azkaban application
Describes how to use Oozie and Azkaban to manage and dispatch mapreduce jobs, pig/hive jobs, etc.

14th Week Two comprehensive cases: Log analysis system and machine learning platform
This paper introduces two typical internet application cases, and further insights into the application scenarios of each system in the Hadoop ecosystem and how to solve practical problems.

2, "Big Data pre-course series--scala"

    • Learning Spark's essentials, a new OOP programming language
    • Mastering the use of functional programming concepts in object-oriented programming

                        • declaration of values and variables
                        • Introduction to Common types
                        • definition and use of functions and methods
                        • conditional Expression
                        • looping and advanced for loop use
                        • Lazy Value
                        • default parameters, named parameters, and variable length parameters
                        • Exception Handling
                        • array-related actions
                        • map Operation

second week Scala object-oriented programming

                      • The
                        • class defines the properties of the
                        • class
                        • main constructor
                        • Secondary constructor
                        • Object Object
                        • Apply method
                        • class inheritance
                        • method overrides and field overrides
                        • abstract class
                        • trait
                        • package definition and use
                        • package Object definition and use
                        • file access

                  • definition of higher-order functions
                  • value Function
                  • Anonymous Functions
                  • Closing the package
                  • Sam and Curry
                  • examples of higher-order functions
                  • Introduction to the
                  • collection
                  • sequence
                  • mutable lists and immutable lists
                  • Collection Operations
                  • Case Class
                  • pattern matching

                        • Generic class
                        • Generic functions
                        • Lower bounds and Upper bounds
                        • View bounds
                        • Context bounds
                        • Covariance and Contravariance
                        • Implicit conversions
                        • Implicitly-parameter
                        • Implicitly-Class

3, "Spark Big Data Platform Foundation"

    • Learn about memory-based batch and streaming data analysis methods
    • Mastering optimized applications for fast, easy-to-use purposes

First week spark eco-system overview and programming model

      • Spark Eco-System Overview
      • Review Hadoop MapReduce
      • Spark Run mode
      • RDD
      • Introduction to the Spark run-time model
      • Introduction to Caching policies
      • Transformation
      • Action
      • Lineage
      • Fault-tolerant processing
      • Wide dependence and narrow dependence
      • Cluster configuration


Next week dive into the spark kernel

      • Spark Terminology Explained
      • Cluster overview
      • Core components
      • Data locality
      • Common Rdd
      • Task scheduling
      • Dagscheduler
      • TaskScheduler
      • Task Details
      • Broadcast variables
      • Accumulator
      • Performance tuning


The third week Spark streaming principle and practice

      • DStream
      • Data source
      • Stateless transformation and stateful transformation
      • Checkpoint
      • Fault tolerant
      • Performance optimization


Principle and practice of shark in the four weeks

      • Data model
      • Data type
      • Shark architecture
      • Shark deployment
      • Cache (partition) table
      • Sharkserver
      • Shark combined with Spark


Week five machine learning on Spark

      • Linearregression
      • K-means
      • Collaborative Filtering


Six weeks of spark multilingual programming

      • About Python
      • Pyspark API
      • Writing Spark programs using Python
      • Spark with Java


Seventh Week Spark SQL

      • Schemas and instances
      • Parquet Support
      • Dsl
      • SQL on RDD


Eighth weekly calculation Graphx

      • The existing Diagram calculation framework
      • Table Operators
      • Graph Operators
      • GRAPHX Design


Nineth Week Spark on Yarn

      • Spark on Yarn principle
      • Spark on Yarn Practice


Tenth Week JobServer

          • Overall architecture
          • API Introduction
          • Configuration and Deployment

4, "Hadoop advanced"

    • In-depth study of MapReduce and its job commissioning and optimization methods
    • Deep mastery of HDFs and system-level operations and performance optimization methods

The first part. Mapreduce

MapReduce Workflow and Basic architecture review

Operation and Maintenance related

      • Parameter tuning
      • Benchmark
      • Reuse JVM
      • Error awareness and speculative execution
      • Task Log Analysis
      • Tolerance for error percentage setting and skipping bad records
      • Select other schedulers such as Fairescheduler to optimize performance

Development-related

      • Data type selection
      • Implement custom writable data types, custom keys
      • Output different types of value in a mapper
      • Inputformat/outputformat, principle and customization
      • The use of Mapper/reducer/combiner,combiner and its effect on the MapReduce framework optimization
      • Partitioner Custom
      • Sorting Policy Groupingcomparator/sortcomparator
      • Task scheduling principle and modification method (case, Map/reduce shared slot, accurate map/reduce dispatch by identity)
      • Streaming
      • Distributedcache
      • dependencies between MapReduce tasks
      • Counter
      • Jobchild parameter settings
      • Performance optimization

The second part. Hdfs

HDFS API

Fuse (C API)

Compression

HDFS Benchmark

Datanode Adding and removing

Multi-disk support, disk error-aware

HDFs raid

HDFS block Size setting related issues

File Backup number settings

Merging files in HDFs

The third part. Hadoop Tools

Dfsadmin/mradmin/balancer/distcp/fsck/fs/job

Monitoring and alerting

Hadoop Configuration Management

Part IV. Hadoop debugging

Log

Local mode debugging Map/reduce tasks

Remote debugging

Part Five. Problem analysis

Java GC Introduction and common analysis tools for Java processes Jstat, Jhat, Jmap

Top/iostat/netstat/lsof, etc.

Jstack/kill-3

Strace

Nload/tcpdump

Part VI. Analysis examples

Simple analysis of MapReduce

Implement group-by with MapReduce

Using MapReduce to implement inverted index

MapReduce Implementation Histogram

MapReduce implements Join

5, "HBase Advanced"

    • Learn to design a reasonable schema for a massive data set
    • Mastering HBase Performance optimization methods and usage scenarios

6. "SQL on Hadoop"

    • Learn about hive SQL parsing and performance optimization, Impala task generation, and more
    • Mastering the Open platform for building data using SQL on Hadoop

7, "Hadoop/spark Enterprise Application Practical"

    • Learn how to use Hadoop and spark in production systems
    • Mastering solutions for integration with existing enterprise BI platforms

The road to Big data learning

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.