Simple implementation of MAP/reduce in Lucene-hadoop and GFS

Source: Internet
Author: User
Hadoop is used to build distributed applications. Program . The hadoop framework provides a set of stable and reliable interfaces for transparent applications. The implementation of this technology is easy to map/normalize the programming paradigm. In this paradigm, an application is divided into many small task blocks. Each such task block is executed or re-executed by a computer on any node in the cluster. In addition, this pattern provides a distributed file system, which is used to store data on computers with high bandwidth between clusters. Both ing/reduction and distributed file systems are designed as fault-tolerant structures. That is to say, when a node in the cluster fails, the entire file system or the ing/reduction operation can still work effectively.

 

Hadoop was initially developed as the basic structure of the nutch project. For more detailed information about nutch, visit Apache's official website on the open-source project of nutch. Both hadoop and nutch are affiliated to the Lucene Apache project. Hadoop ing/reduction (MAP/reduce) Design Model and execution framework:
Ing/reduction is a programming paradigm first in LISP Programming Language Has been implemented, this paradigm regards a large distributed computing as a series of distributed operations built on key/value ing datasets. The hadoop ing/ ing framework uses computers in a cluster to execute User-Defined ing/ ing tasks. A ing/ ing operation can be divided into two phases: ing and ing. Generally, a user provides a dataset corresponding to a key value as a computing input to the system. In the ing phase, the framework divides user input data into multiple fragments and assigns a ing task to each data segment. The system usually assigns many such ing tasks to multiple computers in a cluster that run the framework. Each ing operation involves consuming key-value pairs in a data segment and generating a set of intermediate key-value pairs. For example, for each input key-Value Pair (K, V), the ing operation calls the User-Defined ing function to convert the (K, v) pair into a different key-Value Pair (K ', V ') then, the system structure sorts these intermediate datasets by key values and generates a new (K', V' *) tuples (tuples ). In this way, all data corresponding to different values of the same key will be put together. At the same time, the system also splits these tuples into many fragments, and the number of these fragments is equal to the number of reduction tasks. In the reduction phase, each reduction operation consumes (K', V' *) fragments allocated to it. For each of these tuples, the reduction operation calls the User-Defined reduction function to convert these fragments into key-value pairs (K, v) required by the user for output. Similar to the ing operation, the system assigns many reduction operations to the cluster computer and transmits the intermediate state data set generated in the ing operation to the corresponding reduction operation. Operations at various stages are performed as a fault tolerance style. If a node fails during the operation. The task being executed will be reassigned to other nodes. The simultaneous execution of multiple ing and reduction tasks ensures a good load balance. At the same time, it also ensures that faulty machines can be quickly re-allocated after being restarted. Hadoop ing/reduction architecture:
The hadoop ing/reduction framework is a Master/Slave (Master/Slave) architecture. It consists of a master server (jobtracker) and several slave servers (tasktracker. The master server is the key to dealing with the system. The user will
The ing/reduction operation is submitted to the master server. The master server puts operations in the Job Queue and processes the tasks in the queue according to the first-come-first-served principle. The master server is used to allocate ing or reduction operations to different slave servers. The slave server performs operations under the control of the master server. At the same time, different slave servers also perform data transmission during the ing and reduction phases. Hadoop DFS
Hadoop Distributed File System (HDFS) is designed to store large data files among cluster computers. This design comes from the Google File System (GFS ). The hadoop Distributed File System stores each file as a group of data blocks. All data blocks except the last data block in a file have the same size. As a fault tolerance Processing, these data blocks are copied into many copies. The data block size and the number of copies of each file can be configured by the Administrator. In addition, it is worth noting that all files in HDFS are written only once, and each time point strictly allows only one thread to perform write operations. HDFS architecture: Similar to the map/reduce mechanism of hadoop, HDFS also follows the Master/Slave architecture. An HDFS device includes a control server used to manage the file system namespace and manage the client server's access to files. We call such a node A named node (namenode ). In addition, such a device also includes a series of data nodes (datanode), each of which represents a cluster computer running the HDFS file system storage structure. A naming node uses an RPC structure to open, close, or rename a file system, similar to a file or directory. The naming node is also responsible for deciding to map data blocks to corresponding data nodes. The data node is responsible for providing file system users with read/write operations on files. Data nodes can also create, delete, or copy data blocks under the control of named nodes. Reprint from http://www.chinacloud.cn/show.aspx? Id = 1063 & cid = 12

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.