What is a Hadoop project?
Hadoop is a distributed storage and computing platform for big data.
Doug Cutting;lucene,nutch.
Inspired by three papers from Google
Hadoop Core Project
Hdfs:hadoop Distributed File System distributed filesystem
MapReduce: Parallel Computing Framework
Hadoop Architecture HDFs Architecture
(1) master-slave structure
• Master node, only one: Namenode
• From a node, there are many: Datanodes
(2) Namenode is responsible for: management
• Receive user action requests, can implement the operation of the file system (there are two general ways of operation, command line and Java API mode)
• Maintain the directory structure of the file system (used to classify the files).
• Manage the relationship between the file and the block (the file is divided into which file the block,block belongs to, and the block's order is like a movie clip), and the block is connected to the Datanode.
(3) Datanode is responsible for: storage
• Storing files
• Files are partitioned into blocks (blocks are generally divided by 64M, but each block occupies space that is the actual file space) stored on disk, dividing big data into relatively small block chunks, so that you can make full use of disk space for easy management.
• To keep the data secure, the file will have multiple copies (just like a key, to prevent loss), and the copies will be copied one piece at a separate datanode.
MapReduce Architecture
(1) master-slave structure
• Master node, only one: Jobtracker
• From a node, there are many: Tasktrackers
(2) Jobtracker is responsible for:
• Receive client-submitted computing tasks
• Assign calculation tasks to tasktrackers execution
• Monitor the implementation of Tasktracker
(3) Tasktrackers is responsible for:
• Perform calculation tasks for Jobtracker assignments
Features of Hadoop
(1) Capacity expansion (scalable): can reliably (reliably) store and process gigabytes (PB) of data.
(2) Low cost (economical): You can distribute and process data through a server farm consisting of common machines. The total of these server farms is up to thousands of nodes.
(3) High efficiency (efficient): By distributing data, Hadoop can process them in parallel on the node where the data resides, which makes processing very fast.
(4) Reliability (Reliable): Hadoop can automatically maintain multiple copies of data and automatically redeploy compute tasks after a task fails.
The physical distribution of Hadoop clusters
Figure 1 Physical distribution of Hadoop clusters
Here is a fleet of two racks, there are two colors of green and yellow, it is not difficult to see the Yellow Primary node (Master), Namenode and Jobtracker are exclusive to a server, only one is unique, green for the Slave node (Slave) have multiple. The Jobtracker and namenode,datanode,tasktracker nature of the above are Java processes that invoke each other to achieve their respective functions, while the master and slave nodes generally run in different Java virtual machines. Then the communication between them is the communication across the virtual machine.
These clusters are placed on the server, the server is essentially physical hardware, the server is the main node or from the node, the main thing is to run what role or process, if the above is tomcat he is a Web server, run the database is the database server, So when the server is running Namenode or Jobtracker is the main node, running is Datanode or tasktracker is from the node.
In order to achieve high-speed communication, we generally use LAN, in the intranet can use Gigabit network card, HF switch, optical fiber and so on.
Hadoop cluster Single-node physical structure
What is Hadoops?