Lucene-hadoop, a simple implementation of map/reduce in GFs

Source: Internet
Author: User
Keywords Execute implement DFS
Tags *.h file .mall an application application applications bandwidth block computer
Hadoop is a framework for building distributed applications. The Hadoop framework provides a stable and reliable set of interfaces for applications to be transparent. The implementation of this technology can be easily mapped/reduced programming paradigm. In this paradigm, an application is split into many small task blocks. Each such task block is executed or restarted by the computer of any node in the cluster. In addition, this paradigm provides a distributed file system that is used to store data on computers with high bandwidth between each other in the cluster. mapping/reduction and Distributed file systems are designed to be fault-tolerant structures. That is, when a node in a cluster fails the entire file system or mapping/reduction operation can still function effectively.

Hadoop was initially developed as an infrastructure for the Nutch project. More specific information about Nutch can be accessed by the Apache website on the Nutch Open source project. Both Hadoop and Nutch are subordinate to the Lucene Apache project. The mapping/Reduction (MAP/REDUCE) design model and execution framework for Hadoop:


Mapping/Attribution is a programming paradigm that was first implemented in the Lisp programming language, which sees a large distributed computing as a series of distributed operations built on key/value mapped datasets. The mapping/specification framework of Hadoop utilizes a computer in a cluster to perform user-defined mapping/reduction tasks. A mapping/protocol operation is divided into two phases: the mapping phase and the statute stage. A user usually provides the system with a data set that corresponds to a key value as a computed input. During the mapping phase, the framework divides the data entered by the user into many fragments (fragments) and assigns each piece of data to a mapped task. Many of these mapping tasks are often assigned to multiple computers running the framework in a cluster. Each mapping operation involves consuming the key value pairs in the data fragment and generating a corresponding set of key values in the middle state. For example, for each input key value pair (K,V), the mapping operation invokes the user-defined mapping function to convert (K,V) to a different key value pair (K '), V ') then, the system structure sorts the data sets of these intermediate states according to the value of the key and generates a new (K ', V ' *) tuple (tuples )。 So that all data corresponding to the same key value will be put together. At the same time, the system separates these tuples into many fragments, which are equal to the number of the reduction tasks. At the time of the reduction, each reduction operation consumes fragments of (K ', V ' *) assigned to it. For each such tuple, the reduction operation invokes the user-defined reduction function to convert the fragments into a key value pair (K,V) that the user needs. Similar to mapping operations, the system assigns these many reduction operations to the cluster computer and is responsible for transferring the intermediate data sets generated in the mapping operation to the corresponding reduction operations. Operations at all stages are performed as a fault-tolerant style. If a node fails during the execution of the operation. The task that it is performing will be reassigned to another node. The simultaneous execution of multiple mapping reduction tasks ensures a good load balance, while also ensuring that the failed machine can be reassigned quickly after restarting. Hadoop Mapping/Attribution Architecture:

The
Hadoop Mapping/reduction Framework is a master/master/slave architecture. It consists of a master server (Jobtracker) and several from the server (Tasktracker). The primary server is the key to the user's dealings with the system. User will customize the


The mapping/reduction operation is submitted to the primary server. The primary server puts operations into the job queue and processes the tasks in the queue on a first-come, first-served basis. The primary server is used to assign mapping or reduction operations to different from servers. The operation is performed from the server under the control of the primary server, and the data is also transferred from the server to the mapping and reduction phase. Hadoop DFS


Hadoop's Distributed File System (HDFS) is designed to store large data files between cluster computers. The design comes from the Google File System (GFS). The Hadoop Distributed File system stores each file as a set of blocks of data, and all data blocks in a file except the last block of data have the same size. As a fault-tolerant process, these blocks of data are copied into many parts. The block size of each file and the number of copies copied can be configured by the administrator. In addition, it is worth noting that the files in HDFs are written only once and that each time point strictly allows only one thread to perform the write operation. HDFS Architecture: A mapping protocol similar to Hadoop (Map/reduce), HDFs also follows the Master/slave architecture. A HDFS device includes a control server that manages the namespace of the file system and manages client server access to files, which we call a named node (Namenode). In addition, a series of data nodes (Datanode) are included in such a device, and each data node represents a cluster computer running the storage structure of the HDFs file system. A named node uses an RPC structure to perform operations like file or directory opening, closing, and renaming a file system. Also, the named node is responsible for deciding to map the data block (map) to the corresponding data node. The data node is responsible for providing read-write operations to the file system users. A data node can also create, delete, or copy a block of data under the control of a named node.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.