What is Hadoops?

Last Update:2015-11-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a Hadoop project?

Hadoop is a distributed storage and computing platform for big data.

Doug Cutting;lucene,nutch.

Inspired by three papers from Google

Hadoop Core Project

Hdfs:hadoop Distributed File System distributed filesystem

MapReduce: Parallel Computing Framework

Hadoop Architecture HDFs Architecture

(1) master-slave structure

• Master node, only one: Namenode

• From a node, there are many: Datanodes

(2) Namenode is responsible for: management

• Receive user action requests, can implement the operation of the file system (there are two general ways of operation, command line and Java API mode)

• Maintain the directory structure of the file system (used to classify the files).

• Manage the relationship between the file and the block (the file is divided into which file the block,block belongs to, and the block's order is like a movie clip), and the block is connected to the Datanode.

(3) Datanode is responsible for: storage

• Storing files

• Files are partitioned into blocks (blocks are generally divided by 64M, but each block occupies space that is the actual file space) stored on disk, dividing big data into relatively small block chunks, so that you can make full use of disk space for easy management.

• To keep the data secure, the file will have multiple copies (just like a key, to prevent loss), and the copies will be copied one piece at a separate datanode.

MapReduce Architecture

(1) master-slave structure

• Master node, only one: Jobtracker

• From a node, there are many: Tasktrackers

(2) Jobtracker is responsible for:

• Receive client-submitted computing tasks

• Assign calculation tasks to tasktrackers execution

• Monitor the implementation of Tasktracker

(3) Tasktrackers is responsible for:

• Perform calculation tasks for Jobtracker assignments

Features of Hadoop

(1) Capacity expansion (scalable): can reliably (reliably) store and process gigabytes (PB) of data.

(2) Low cost (economical): You can distribute and process data through a server farm consisting of common machines. The total of these server farms is up to thousands of nodes.

(3) High efficiency (efficient): By distributing data, Hadoop can process them in parallel on the node where the data resides, which makes processing very fast.

(4) Reliability (Reliable): Hadoop can automatically maintain multiple copies of data and automatically redeploy compute tasks after a task fails.

The physical distribution of Hadoop clusters

Figure 1 Physical distribution of Hadoop clusters

Here is a fleet of two racks, there are two colors of green and yellow, it is not difficult to see the Yellow Primary node (Master), Namenode and Jobtracker are exclusive to a server, only one is unique, green for the Slave node (Slave) have multiple. The Jobtracker and namenode,datanode,tasktracker nature of the above are Java processes that invoke each other to achieve their respective functions, while the master and slave nodes generally run in different Java virtual machines. Then the communication between them is the communication across the virtual machine.

These clusters are placed on the server, the server is essentially physical hardware, the server is the main node or from the node, the main thing is to run what role or process, if the above is tomcat he is a Web server, run the database is the database server, So when the server is running Namenode or Jobtracker is the main node, running is Datanode or tasktracker is from the node.

In order to achieve high-speed communication, we generally use LAN, in the intranet can use Gigabit network card, HF switch, optical fiber and so on.

Hadoop cluster Single-node physical structure

What is Hadoops?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

What is Hadoops?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

What is Hadoops?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support