Hadoop Learning Notes-initial knowledge of Hadoop

Source: Internet
Author: User

Hadoop is a distributed storage and computing platform for big data, distributed storage is HDFs (Hadoop distributed File System), and the compute platform is mapreduce. Hadoop is distributed storage data, data is transmitted over the network during storage, and bandwidth is limited, so if you use Hadoop at a small data scale, it's probably not as efficient as the current solution, so Hadoop fits into the scale of big data.

There are various versions of current Hadoop, Apache,cloudera,yahoo, and so on, all in the subsequent learning process, using Apache version 1.1.2.

Two core components of Hadoop: HDFs and MapReduce. HDFs is a distributed file system, used to store files, MapReduce is a computational framework, through parallel computing to improve computational efficiency, MapReduce read the data in HDFs to calculate, and then save the results to HDFs.

HDFs is a master-slave architecture, the primary node is only one, that is, Namenode, from the node can have many, that is, multiple datanode. Master Node Namenode is responsible for receiving the user's operation request, HDFs is a distributed file system, users can create folders on HDFs, move delete files, upload files and so on, this series of operation requests are received by Namenode. Namenode can also maintain the directory structure of the file system and manage the relationship between the file and block blocks, and the relationship between the block and the Datanode. Storing files in HDFs is stored in block blocks, so if a file is broken into chunks, then how do we know the order of the files when we need to read them? So Datanode is documenting the relationship between these block blocks and the files. From the node Datanode is responsible for storing files, a large file is divided into blocks block storage, these block blocks are stored on the Datanode, so datanode on the storage of files partitioned blocks on disk, and in order to ensure the security of data, There are often many copies of files that copy many file backups to prevent file loss. This is the master-slave HDFs.

MapReduce is also a master-slave architecture, like HDFs, with only one master node, Jobtracker, and a number of slave nodes, namely Tasktracker. Similar to HDFs, the master node Jobtracker receives user-submitted computing tasks and assigns the compute tasks to Tasktracker to perform, while monitoring the execution of Tasktracker. Tasktracker is responsible for performing the compute tasks assigned by Tasktracker. Jobtracker and Tasktracker are equivalent to the relationship between the project manager and the programmer in our software development process, and the project manager is the equivalent of Jobtracker, responsible for negotiating requirements analysis with the user, collecting the user's needs, and assigning the task to the programmer. And the programmer is tasktracker, to complete the task of Jobtracker assignment, while Jobtracker (project manager) will monitor the implementation of Tasktracker (programmer), when the Tasktracker (programmer) to complete a task, In order to recover tasks in a timely manner and assign new tasks.

Here are the features of Hadoop: 1. Strong capacity for expansion. Able to reliably store and process petabytes of data. 2. Low cost. Hadoop through the Distributed cluster processing data, the server cluster for the machine requirements are not high, and the number of server clusters can reach thousands of units, the general computer minicomputer, so the cost is very low. 3. High efficiency, the high efficiency of Hadoop is reflected by distributed computing, and parallel computing on multiple nodes makes processing very fast. 4. Reliability. A lot of copies are generated in Hadoop's stored procedures, which guarantees the security of the data. And Hadoop is able to automatically maintain these replicas and automatically redeploy the compute tasks after the task fails.

In both HDFs and MapReduce, there is no difference between a master node and a physical machine with multiple nodes. The reason why the separation of Chu from the node, is through the process of running on the machine to differentiate, if running jobtracker and namenode process, then this machine is the main node, conversely if the run of Datanode and Tasktracker, then this machine is from the node. The implementation of Hadoop is the implementation of a Java process.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.