A brief introduction to Hadoop learning Notes

Source: Internet
Author: User
Keywords Nbsp; dfs name
Tags an application applications client computing data distributed distributed file system document


Here's a rough introduction to Hadoop.
    This article is mostly from the official website Hadoop. One of them is an introduction to HDFs's PDF document, which is a comprehensive introduction to Hadoop. My this series of Hadoop learning Notes is also from here step-by-step down, at the same time, referring to a lot of articles on the Web, to learn the problems encountered in Hadoop summarized.
    Let's start with the ins and outs of Hadoop. When it comes to Hadoop, you have to mention Lucene and Nutch. First of all, Lucene is not an application, but provides a pure Java high-performance full-text Indexing Engine toolkit, which can be easily embedded in a variety of practical applications to achieve full-text search/indexing capabilities. Nutch is an application, is a lucene based on the implementation of the search engine applications, Lucene provides nutch text search and indexing Api,nutch not only the search function, but also the function of data. Before the nutch0.8.0 version, Hadoop was part of the Nutch, and from nutch0.8.0, the NDFs and MapReduce that were implemented were stripped out to create a new Open-source project, which was Hadoop, and The nutch0.8.0 version has a fundamental change in architecture over previous Nutch, which is built entirely on the basis of Hadoop. Google's GFS and MapReduce algorithms are implemented in Hadoop, making Hadoop a distributed computing platform.
    In fact, Hadoop is not just a distributed file system for storage, but a framework designed to perform distributed applications on a large cluster of general-purpose computing devices.

   hadoop contains two sections:

   1, HDFS

        The Hadoop Distributed File System (Hadoop distributed filesystem)
      hdfs is highly fault tolerant, And can be deployed on low-priced hardware devices. HDFs is ideal for applications with large datasets and provides a high throughput for reading and writing data. HDFs is a master/slave structure, in the usual deployment, in mastOnly one namenode is run on the ER, and a datanode is run on each slave. The
      hdfs supports the traditional hierarchical file organization, which is similar to some existing file systems, such as you can create and delete a file, move a file from one directory to another, Rename, and so on. Namenode manages the entire distributed file system, and operations on file systems (such as creating, deleting files, and folders) are controlled by Namenode. &NBSP
      Below is the structure of the HDFs:






      as can be seen from the above diagram, communication between namenode,datanode,client is based on TCP/IP. When a client wants to perform a write operation, the command is not sent immediately to Namenode,client first cache the data in the temporary folder on this computer, and when the data blocks in the temporary folder reach the value of the set block (the default is 64M), the client notifies Namenode,namenode responds to the client's RPC request, inserts the file name into the filesystem hierarchy and finds a block in the Datanode that holds the data, and tells the client about the Datanode and the corresponding chunks of data. Client writes the data blocks in these local temporary folders to the specified data node. The
      hdfs takes a replica strategy to improve the reliability and availability of the system. The replica placement policy for HDFS is three replicas, one on this node, one on another node in the same rack, and one copy on a node in a different rack. The current version of the hadoop0.12.0 has not yet been implemented, but is in progress, I believe it will soon be out. The

   2, mapreduce implementation

      mapreduce is an important technology for Google, It is a programming model for calculating large amounts of data. For the calculation of large amount of data, the usual method is parallel computation. For many developers at least at this stage, parallel computing is a far more distant thing. MapReduce is a programming model that simplifies parallel computing, allowing developers with little parallel computing experience to develop parallel applications. The name of the
      mapreduce derives from two core operations in this model: map and Reduce. Perhaps people who are familiar with functional programming (functional programming) will be very kind to see these two words. In short, map is a one-to-one mapping of a set of data to another set of data, whose mapping rules are specified by a function, such as the mapping of [1, 2, 3, 4] multiplied by 2 becomes [2, 4, 6, 8]. Reduce is about a set of data, whichThe rules are specified by a function, such as the sum of [1, 2, 3, 4] The result is 10, and the result of its quadrature is 24.
       about MapReduce, it is recommended to see this mapreduce:the free lunch are not over! Meng

    Well, as the first article in this series is written so much, I am also just beginning to touch Hadoop, the next is to talk about the deployment of Hadoop, I am in the deployment of Hadoop encountered problems, but also give you a reference, less take a detour.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.