Introduction to distributed computing hadoop

Source: Internet
Author: User

What is hadoop: hadoop is a software platform for developing and running large-scale data processing. It is an open-source software framework implemented by appach in Java, it enables distributed computing of massive data in a cluster composed of a large number of computers.

What is hadoop: hadoop is a software platform for developing and running large-scale data processing. It is an open-source software framework implemented by appach in Java, it enables distributed computing of massive data in a cluster composed of a large number of computers.

The core designs in the hadoop framework are HDFS and mapreduce. HDFS provides massive data storage, while mapreduce provides data computing.

The process for processing data in hadoop can be simply understood as follows: the data is processed by the haddop cluster and the result is obtained.


HDFS: Hadoop Distributed File System: Distributed File System of hadoop.

Large files are divided into 64 mb data blocks by default and stored in cluster machines.

For example, the file data1 in is divided into three parts.Redundant imageIn different machines.


Mapreduce: hadoop uses
Split creates a task and calls map computing. in this task, records in the split are processed in sequence. Map uses key -- Value
Hadoop is responsible for sorting the map output by key value as the reduce input. Reduce
The task output is the output of the entire job and is saved on HDFS.


Hadoop clusters are mainly composed of namenode, datanode, secondary namenode, jobtracker, and tasktracker.

As shown in:


Namenode records how files are split into blocks and the blocks are stored on those datenode nodes.

Namenode also saves the running status information of the file system.

Datanode stores the split blocks.

Secondary namenode helps namenode collect the status information of the file system.

When a job is submitted to a hadoop cluster, jobtracker is responsible for running the job and scheduling multiple tasktrackers.

Tasktracker is responsible for a map or reduce task.

 

From: http://hechuanzhen.iteye.com/blog/1748106

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.