Hadoop Brief Introduction to _hadoop

Source: Internet
Author: User

Most of this article is from the official website of Hadoop. One of them is an introduction to HDFs's PDF document, which is a comprehensive introduction to Hadoop. My this series of Hadoop learning Notes is also from here step-by-step down, at the same time, referring to a lot of articles on the Web, to learn the problems encountered in Hadoop summarized.
Anyway, let's start with the ins and outs of Hadoop. When it comes to Hadoop, you have to mention Lucene and Nutch. First of all, Lucene is not an application, but provides a pure Java high-performance full-text Indexing Engine toolkit, which can be easily embedded in a variety of practical applications to achieve full-text search/indexing capabilities. Nutch is an application, is a lucene based on the implementation of the search engine applications, Lucene provides nutch text search and indexing Api,nutch not only the search function, but also the function of data capture. Before the nutch0.8.0 version, Hadoop was part of the Nutch, and from nutch0.8.0, the NDFs and MapReduce that were implemented in it were stripped out to create a new open source project, which was Hadoop, and the nutch0.8.0 version was more than the previous nutch in the architecture The fundamental change is that it is entirely built on the basis of Hadoop. Google's GFS and MapReduce algorithms are implemented in Hadoop, making Hadoop a distributed computing platform.
In fact, Hadoop is not just a distributed file system for storage, but a framework designed to perform distributed applications on a large cluster of general-purpose computing devices.

Hadoop contains two parts:

1, HDFS

Hadoop Distributed File System (Hadoop distributed filesystem)
HDFs is highly fault tolerant and can be deployed on low-priced hardware devices. HDFs is ideal for applications with large datasets and provides a high throughput for reading and writing data. HDFs is a master/slave structure that, for normal deployments, runs only one namenode on master and one datanode on each slave.
HDFs supports the traditional hierarchical file organization structure, which is similar to some existing file systems, such as you can create and delete a file, move a file from one directory to another, rename, and so on. Namenode manages the entire distributed file system, and the operation of file systems (such as creating, deleting files, and folders) is controlled by Namenode.
The following is the structure of the HDFS:


As can be seen from the above diagram, communication between namenode,datanode,client is based on TCP/IP. When a client wants to perform a write operation, the command is not sent immediately to Namenode,client first cache the data in the temporary folder on this computer, and when the data blocks in the temporary folder reach the value of the set block (the default is 64M), The client notifies the Namenode,namenode of the client's RPC request, inserts the file name into the filesystem hierarchy, and finds a block in the Datanode that holds the data. At the same time, the Datanode and the corresponding data block information are told to the Client,client to write the data blocks in these local temporary folders to the specified data node.
HDFs has taken a replica strategy to improve the reliability and usability of the system. The replica placement policy for HDFS is three replicas, one on this node, one on another node in the same rack, and one copy on a node in a different rack. The current version of the hadoop0.12.0 has not yet been implemented, but is in progress, I believe it will soon be out.

2, the realization of MapReduce

MapReduce is an important technology for Google, a programming model for computing large amounts of data. For the calculation of large amount of data, the usual processing method is parallel computation. For many developers at least at this stage, parallel computing is a far more distant thing. MapReduce is a programming model that simplifies parallel computing, allowing developers with little parallel computing experience to develop parallel applications.
MapReduce's name stems from two core operations in this model: map and Reduce. Perhaps people who are familiar with functional programming (functional programming) will be very kind to see these two words. In short, map is a one-to-one mapping of a set of data to another set of data, whose mapping rules are specified by a function, such as the mapping of [1, 2, 3, 4] multiplied by 2 becomes [2, 4, 6, 8]. Reduce is a set of data to be reduced, the rule of the reduction is specified by a function, such as [1, 2, 3, 4] The sum of the result is 10, and the result of the product is 24.
Regarding the content of MapReduce, we suggest to see this Meng mapreduce:the free lunch are not over!

Well, as the first part of this series is written so much, I am also just beginning to touch Hadoop, the next is to talk about the deployment of Hadoop, I am in the deployment of Hadoop encountered problems, but also to everyone a reference, a little detour.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.