Research and implementation of cloud computing and storage platform based on Hadoop
Source: Internet
Author: User
KeywordsCloud computing Namedfs
With the development of Internet technology, the number of Yu information is increasing exponentially, according to the Internet Data cente: Published digital Universe report shows that the amount of data generated in the next 8 years will reach ZB, equivalent to each generation of 5200 G data, How to efficiently compute and store these vast amounts of data is a challenge for internet companies. The traditional large-scale data processing mostly uses the parallel computation, the grid computation, the distributed high performance computation and so on, consumes the expensive storage and the computation resources, moreover has the complex programming to be possible for the large-scale computation task effective allocation and the data reasonable division. Based on the emergence of Hadoop distributed cloud Platform as a good way to solve such problems, this paper will use VMware virtual machines to build an efficient, easy to expand, based on Hadoop distributed technology, based on the overview of Hadoop core technology: HDFs and MapReduce. Exhibition of cloud data computing and storage platform, and through experiments to verify the advantages of distributed computing and storage.
1. Hadoop and its related technologies
Hadoop is a product of the development of parallel technology, distributed technology and Grid computing, and it is a model structure which is developed to adapt to large-scale data computation and storage. Hadoop is a distributed computing and storage framework platform for Apache, which can efficiently store large amounts of data and can write distributed applications to analyze and compute massive amounts of data. Hadoop runs programs in a cluster of inexpensive hardware devices, providing reliable and stable interfaces for applications to build distributed systems with high scalability and high reliability. Hadoop has the advantages of low cost, high reliability, high fault tolerance, strong scalability, high efficiency, portability and free open source.
The Hadoop cluster is a typical master/slave, architecture, cloud-based and storage architecture model based on Hadoop, as shown in Figure 1.
Figure 1 The cloud computing and storage architecture model based on Hadoop
1.1 Hadoop Distributed File System HDFs
HDFs is a distributed file system running on a large number of inexpensive hardware, which is the underlying file storage system of the Hadoop platform, which is mainly responsible for data management and storage, and has good performance for large file data access. HDFs is similar to the traditional distributed file system, but it also has some characteristics, such as hardware failure, large data set, simple consistency, data stream access, and convenient mobile computing. HDFs's workflow and architecture are shown in Figure 2.
Figure 2 HDFs Workflow and architecture
A HDFS cluster has a namenode and multiple datanode. As shown in Figure 2, Namenode is a hub server that manages the metadata information of the file system as well as the client's read and write access to the file, maintaining all files and directories under the file system tree and its child nodes. This information is saved on disk in the form of an edit log file (Editlog) and a namespace mirroring file (fsimage). Namenode also temporarily records the Datanode information of each block. Its functions include: Managing metadata and file blocks, simplifying metadata update operations, listening and processing requests.
Datanode is usually a node in the cluster, used to store, retrieve blocks of data, in response to Namenode issued to create, copy, delete data block commands, and timed to Namenode send "heartbeat", through the heartbeat information to Namenode report their load situation, At the same time through the heartbeat information to accept Namenode issued instructions information; Namenode through the "Heartbeat" information to determine whether the Datanode is invalid, it timed ping each datanode, if not received in the specified time Datanode feedback that the node is invalid, Then load-adjust the entire system. In HDFs, each file is divided into one or more blocks (data blocks) distributed in different Datanode, datanode between the data blocks to replicate each other to form multiple backups.
1.2 Map/reduce Programming Framework
Map/reduce is a programming framework that Hadoop uses to deal with massive amounts of data in cloud computing, where programmers can write programs to handle massive amounts of data without having to understand the underlying implementation details. Map/reduce technology can be used in thousands of of servers at the same time to carry out advertising services and network search tasks, and can easily handle TB, PB, even EB-level data.
The Map/reduce framework consists of Jobtracker and Tasktracker. Jobtracker has only one, it is the main node, responsible for the assignment and scheduling of tasks, management of a few tasktracker;tasktracker a node, to accept and handle the task Jobtracker sent.
MapReduce a distributed operation for a large dataset in a cluster, the entire framework consists of a map and a reduce function that performs the map and then performs reduce when processing data. The specific execution process is shown in Figure 3. The input data is fragmented before the map function is executed, then the different fragments are assigned to different map executions, and the map function is processed in the form of (Key,value); Before entering the reduce phase, the map function first converts the original (Key,value) The key value in the middle of multiple groups is then sent to a reducer for processing; the last reduce function merges the same value of key and outputs the result to disk.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.