Hadoop Learning Summary

Source: Internet
Author: User

Recently learned the framework of Hadoop, summed up its own understanding of the following:

1. Hadoop runs through a jobtracker dispatch task to a series of Tasktracker, Tasktracker simultaneously returns the task run result to Jobtracker.

Jobtracker shards the input data for the job, then creates a map task for each shard, creates a certain number of reduce tasks, and assigns idle tasktracker to perform these tasks.

Tasktracker obtains the task jar package and the input data of the Shard from the Jobtracker, then creates a new JVM to execute and periodically feeds back the progress of the execution.

2, Map task is to carry out the original data extraction work, extract the data after the Sufflix sort, sorted data as the input of reduce, and then through the statistical calculation of reduce to get the final result.

3, HDFs to the data block, and then stored in the Datanote, Datanote to namenode report their own stored file block, the client through the access to the Namenode to know the composition of the various data blocks of the specific storage datanote, The entire file is then read from the Datanote.

4, the input data of the Hadoop job must be fixed before the job is executed before the data can be fragmented, so it is not capable of incremental streaming data processing jobs.

5, the MapReduce itself needs to write Java code, optimization is also more complex, and pig can let the user through the script statement Latin to manipulate the data, and the script statements into MapReduce to execute in Hadoop, Pig does not require the data being manipulated to satisfy the relational database schema.

6. Hive is built into Hadoop's data warehouse, which can add metadata to data stored in HDFS, providing SQL-style data manipulation and translating operations into mapreduce for execution in Hadoop.

The main difference between hive and pig is that hive provides SQL-style database table operations, while Pig uses a custom Latin script to manipulate data directly, and pig is lighter.

7. HBase is a column-stored distributed database whose tables do not satisfy the relational database schema and do not support SQL queries, and can store data on HDFs.

The main difference between hbase and hive is that HBase is a distributed database that is easy to scale horizontally while hive is not.

The second is that hbase data is stored in columns and thus can be flexibly read and written, and hive itself does not change the data storage format, but only adds table metadata to the data to support relational database table operations.

8, zookeeper encapsulated a set of distributed services needed to use a common mechanism, its own implementation is through multiple server redundancy to achieve high availability, can automatically elect a leader to serve the user's write requests, and automatically synchronized to half of the followers, leaders and followers can serve the user read requests.

Application scenarios include: Highly available distributed Configuration Storage, distributed object Status update subscription notifications, distributed locks, distributed cluster leader elections.

Finish

Hadoop Learning Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.