Analysis of large data processing--hadoop (i.)

Source: Internet
Author: User
Keywords Can name therefore large data

Overview

This era is known as the Big Data age, the production of various industries produced a burst of growth, and based on these explosive growth data to do in-depth data mining, analysis, processing. So we can easily feel that in such a large data age, many of our ways of doing things are changing. For example, based on large data analysis can do disease predictive control, based on large data analysis can do traffic flow forecast control; Based on large data analysis can do large-scale system fault diagnosis prediction; Based on large data analysis can do customer consumption recommendation. It can be said that the big data age can solve a lot of previously very difficult to solve the problem. It can be said that in such an era, big data can make our life more beautiful.

The sudden big Data era has had a huge impact on the technology world. The biggest problem is how to store such huge amounts of data and how to handle such huge amounts of data. On this issue, on Google's commercial system, the emergence of the open source large data processing system Hadoop.

Google published its first core technology article on cloud Computing in 2003, when the Apache technology team realized that the GFS architecture was a good solution to the problem of massive files generated during the web search engine establishment. Therefore, the GFS technical framework has been referenced, and a set of large data file systems has been completed and open source. The filesystem later evolved into Hadoop's core project HDFs. 2004 Google again published the core technology of cloud computing mapreduce, solve the large-scale distributed computing programming model problem. Subsequently, MapReduce's ideas were applied to the predecessor project of Hadoop and open source. In 2006, Yahoo made the Hadoop project independent of the Nutch search engine project and became a separate subproject for Apache. Since then, the Hadoop project has flourished.

As of October 2013, the Hadoop version 2.2.0 has been successfully released. Facebook, Alibaba, Baidu and Tencent are all using Hadoop to deploy large data-processing platforms. The following is an analysis of the key systems in the Hadoop project.

Distributed File System

In the context of large data applications, file storage has the following features:

1, mass data storage. In a large data application environment, both the number of files and the size of the data storage are huge. Therefore, if the traditional storage pattern is adopted, a very large system needs to be built and the traditional storage support is very flexible and scalable. Because, in the big data age, the storage scalability is very important, the data volume is increasing, the storage scale also needs to follow up at any time, therefore, the storage scalability is facing the massive data storage inevitable choice. Traditional storage is mostly oriented to single machine high-performance, stand-alone high capacity. Large data storage needs to be geared towards low-cost scalability to cope with the ever-increasing demand for mass storage. As a result, traditional storage has encountered bottlenecks in large data environments, as customer needs have changed.

2, large file storage. In a large data application environment, the stored files are largely based on large files. This is a very important application feature. In the design of the traditional storage oriented primary storage domain, mainly consider the small file's reading and writing optimization, so the traditional storage in the large data application environment, it is very wasteful, the main focus of the side is not very well applied, and the user needs of the side has not been taken seriously.

3, read more and write less file storage. In a large data application environment, read requests are much more than write requests. Especially in the field of the Internet, write requests are not many, but the number of read requests is very large, so, in read more write less application requirements, how to optimize the storage design is to be considered.

4, concurrent access. In the large data application environment, the number of application clients is very large, so how to avoid the bottleneck of file system data access, enhance the ability of multiple client concurrent access is an important issue to be considered in the design of distributed file system.

To meet this demand, Google proposes the GFS Distributed File system architecture, which is used by the Hadoop Distributed file system. The structure of this distributed file system is as follows:

In terms of structure, this distributed file system is relatively simple. It is mainly divided into two parts, the first part is used to manage the file directory structure and file metadata controller, this controller is called Namenode, the second part is used to store data datanode. When a client needs access to a file, it first needs access to Namenode to obtain file information and data distribution characteristics. After obtaining this information from Namenode, the data process that accesses the file later does not go through the Namenode, and the client communicates directly with Datanode. The data access of this distributed file system belongs to "Out-of-band mode", so it can be very high and distribute, and different clients can access different datanode.

One drawback of this distributed file system is that it handles small files. Because each file access operation accesses the Namenode. If you work with small files, Namenode will be accessed frequently, so namenode will be a bottleneck for the entire system. Fortunately, in a large data application environment, the main processing is large files, so the Distributed File system architecture can meet the requirements of large data applications.

This distributed file system has strong storage scalability. If the user wants to expand the storage capacity, just add a datanode, and add the Datanode to the Namenode for management. Datanode extensions are transparent to the client. The expansion of Datanode extends storage capacity while extending the overall data throughput of the system. The biggest problem with this architecture is that it has the potential bottleneck point Namenode, and the biggest benefit of using Namenode is that it reduces the complexity of the design implementation.

Namenode is the metadata server of the whole system, therefore, performance and single point of failure become the first problem to be considered in the Namenode design process. In order to improve performance, Namenode can adopt high performance servers, and can improve the performance of meta data processing by clustering. In addition, in order to avoid single point of failure, HA can be used to enhance the reliability of namenode. To this end, many manufacturers put forward a variety of HA solutions to ensure that namenode in the shortest possible time to failover, improve the entire system of service quality and reliability. Namenode Design optimization is the focus of the application of Hadoop Distributed File system.

Data reliability is also a problem that Hadoop's distributed file system needs to consider. To reduce the cost of the entire system, Datanode can be built with Low-cost servers. The RAID solutions in traditional storage can be removed from these servers to minimize the cost of a single point of storage. How do you ensure data reliability in this distributed file storage system that does not use RAID? The idea of Hadoop is the same as GFS, with multiple copies of the strategy to ensure data reliability. Common documents Use 3 copies of the strategy, important documents using 6 copies of the strategy. When writing a file, the data is written to a datanode, and then the Datanode writes the copied data to the other datanode. The benefits of multiple copies are simple to implement, and most importantly, you can also detach read and write, and you can make requests for concurrent reads. The drawbacks of multiple copies are also obvious, with large amounts of data replication leading to a significant reduction in storage utilization. In order to improve the utilization of data storage space, Erasure Code is introduced into the distributed storage System. Erasure code can realize data redundancy similar to the effect of traditional raid algorithm, but it needs datanode to have some data computing ability, so the introduction of this algorithm will have some influence on the writing performance of the whole system. In addition, because of the Erasurce code to split the data, using data redundancy code to enhance the reliability of the data, therefore, can not play a number of copies of the method of reading and writing separation, read concurrent purposes, it is inevitable to read performance will also have a certain impact. Storage space utilization and performance are mutually exclusive, so how to balance these two requirements is designed to be considered. Facebook has done a lot of work on Erasure code, and in 2013 published an article on Erasurce code applied in large data, xoring elephants:novel erasurce codes for great 》。

Hadoop Distributed File System is a commonly used structure in large data storage system, according to this kind of structure, Taobao developed TFS used to store the pictures and video files of Taobao's electricity quotient system, and optimized the characteristics of Taobao. The biggest optimizations include:

1 simplify the Namenode file directory structure. Taobao storage files do not need to use a complex directory tree to manage, flat structure can meet the requirements, each file can use a 64-bit ID to describe it.

2 Taobao pictures have large and small, for this small file will be merged into a large file, and finally the big file storage. The idea is similar to Facebook's haystack system.

The Hadoop Distributed file storage System plays a cornerstone role in large data processing, and the later distributed data processing and distributed database can be built on the basis of Hadoop FS.

< to Be continued >

This article is from the "Storage Path" blog, please be sure to keep this source http://alanwu.blog.51cto.com/3652632/1416743

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.