What is Hadoops?

Source: Internet
Author: User

What is a Hadoop project?

Hadoop is a distributed storage and computing platform for big data.

Doug Cutting;lucene,nutch.

Inspired by three papers from Google

Hadoop Core Project

Hdfs:hadoop Distributed File System distributed filesystem

MapReduce: Parallel Computing Framework

Hadoop Architecture HDFs Architecture

(1) master-slave structure

• Master node, only one: Namenode

• From a node, there are many: Datanodes

(2) Namenode is responsible for: management

• Receive user action requests, can implement the operation of the file system (there are two general ways of operation, command line and Java API mode)

• Maintain the directory structure of the file system (used to classify the files).

• Manage the relationship between the file and the block (the file is divided into which file the block,block belongs to, and the block's order is like a movie clip), and the block is connected to the Datanode.

(3) Datanode is responsible for: storage

• Storing files

• Files are partitioned into blocks (blocks are generally divided by 64M, but each block occupies space that is the actual file space) stored on disk, dividing big data into relatively small block chunks, so that you can make full use of disk space for easy management.

• To keep the data secure, the file will have multiple copies (just like a key, to prevent loss), and the copies will be copied one piece at a separate datanode.

MapReduce Architecture

(1) master-slave structure

• Master node, only one: Jobtracker

• From a node, there are many: Tasktrackers

(2) Jobtracker is responsible for:

• Receive client-submitted computing tasks

• Assign calculation tasks to tasktrackers execution

• Monitor the implementation of Tasktracker

(3) Tasktrackers is responsible for:

• Perform calculation tasks for Jobtracker assignments

Features of Hadoop

(1) Capacity expansion (scalable): can reliably (reliably) store and process gigabytes (PB) of data.

(2) Low cost (economical): You can distribute and process data through a server farm consisting of common machines. The total of these server farms is up to thousands of nodes.

(3) High efficiency (efficient): By distributing data, Hadoop can process them in parallel on the node where the data resides, which makes processing very fast.

(4) Reliability (Reliable): Hadoop can automatically maintain multiple copies of data and automatically redeploy compute tasks after a task fails.

The physical distribution of Hadoop clusters

Figure 1 Physical distribution of Hadoop clusters

Here is a fleet of two racks, there are two colors of green and yellow, it is not difficult to see the Yellow Primary node (Master), Namenode and Jobtracker are exclusive to a server, only one is unique, green for the Slave node (Slave) have multiple. The Jobtracker and namenode,datanode,tasktracker nature of the above are Java processes that invoke each other to achieve their respective functions, while the master and slave nodes generally run in different Java virtual machines. Then the communication between them is the communication across the virtual machine.

These clusters are placed on the server, the server is essentially physical hardware, the server is the main node or from the node, the main thing is to run what role or process, if the above is tomcat he is a Web server, run the database is the database server, So when the server is running Namenode or Jobtracker is the main node, running is Datanode or tasktracker is from the node.

In order to achieve high-speed communication, we generally use LAN, in the intranet can use Gigabit network card, HF switch, optical fiber and so on.

Hadoop cluster Single-node physical structure

What is Hadoops?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.