Analysis of the relationship between large data and Hadoop

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Can big data DFS this

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Henry and I are working on an examination of the big data and its true meaning. Large data is a popular language. As with many popular words, the word "big data" is a bit overused, but it contains some real usability and technology. We decided to analyze the big data on this topic and try to find out the authenticity of it and what they mean to storage solutions.

Henry started this series with a good introduction. His definition of big data is the best definition I have ever seen. So I'm going to repeat this definition:

Large data is the process of turning data into information and then knowledge.

This definition is appropriate, because the adjective "big" can express many kinds of meanings. Some people think of "big" as meaning in their area of focus, and we focus on what you can do with that data and why.

Henry and I decided to use two parts for this discussion. Henry starts with the most basic hardware itself, and then discusses the stack up. More accurately, he wants to know which aspects of the hardware are important to large data and which technologies are important. I start with the top of the big data stack, which is the application, and then I discuss the stack down. We'll meet somewhere in the middle and then summarize our thoughts and comments into a final article.

Starting from the top is not easy, my original article has become very long. So together we decided to divide it into three parts. The first section starts with some basic questions from the top of the discussion stack, including the importance of introducing data into the storage system for use (which is more important than most people realize). It also discusses the most common tool--nosql database for large data. The second section analyzes 8 NoSQL database types that are used for large data and that affect storage. The last section on the top of the stack will discuss the role of Hadoop in large data and how all of these things relate to analysis tools such as R.

The connection to Hadoop

All the databases mentioned in previous articles need a place to store their data, and performance is an important part of them. Some of the tools we've mentioned are related to the use of Hadoop as a storage platform. Hadoop is not really a file system, in fact, it is a software framework that supports data-intensive http://www.aliyun.com/zixun/aggregation/13506.html "> Distributed Applications, For example, some of the applications discussed here, as well as those discussed in previous articles. When working with MapReduce, Hadoop can be a very effective solution for data-intensive applications.

The Hadoop file System (Hdfs:hadoop) is an open source file system that originated in the Google FileSystem (gfs:google file system). However, GFS is dedicated to Google. Hadoop is written in Java, a distributed file system, a real metafile system--in other words, a file system that works at the top of the underlying file system. It is designed to be a fault-tolerant file system, allowing copies of data to be stored in different locations within the filesystem, so it is fairly easy to recover data from faulty data replicas or downtime servers. However, these replicas can also be used to improve performance.

The basic building blocks of Hadoop are called "Datanode" (data nodes). This is a combination of a server and some storage and networking. Storage is typically a storage (DAS) within the server or directly connected to the server. Each datanode uses a dedicated HDFs-oriented block protocol to provide data on a network (Ethernet). A certain number of datanode are distributed across a plurality of racks, and each datanode can be partially identified by its rack. Hadoop also has a metadata server, known as the "Namenode" (named node). Namenode is also a HDFS management node. In addition, HDFS has a level two namenode, but it is not a failed recovery metadata server, but is used for other file system tasks, such as the snapshot Master Namenode directory information to help reduce downtime-if a namenode failure occurs. Because there is only one namenode, it may become a potential bottleneck or a HDFs single point of failure.

An important feature of HDFS is that data is replicated to multiple datanode to help improve resilience. HDFs By default, three copies of data are stored on different datanode. Two replicas are on the same rack and the other on different racks (so you can access your data even if a rack is broken). You can run a set of tasks (note that these datanode can also run tasks while storing and providing data) on Datanode that have the required data--one of the three datanode that has a copy of the data by default.

This is what many people refer to: "Move a task to data, rather than move the data to a task." This reduces data migration and reduces network burdens because data is migrated to run tasks without migrating. Once the task starts, all data access is local, so there is no need to datanode or use multiple data servers to satisfy parallel data access. The parallelism of Hadoop is reflected in the performance of the application, and multiple copies of the same application can be run concurrently and access different datasets. In addition, because you have three copies of data, you can run three characters at a time to access the same file, so performance has improved.

At the back end, Datanode can communicate with other Datanode, using RPC (remote procedure Call) to perform a series of tasks:

The capacity balance between the datanode is realized under the precondition of complying with the data replication rules;

Compare files to each other to overwrite the damaged copy with the correct copy;

Check the number of copies of the data and, if necessary, add additional copies;

It is important to note that HDFs is not a POSIX (portable operating system Interface) compliant file system, mainly because performance can be improved.

If you are using the Java API (Application programming Interface), the Thrift API, the command-line interface, or browsing through the Hdfs-ui interface on HTTP, accessing the data in HDFs is fairly straightforward. In addition to this, it is not possible to load HDFS directly on the operating system. The only solution is to use the Linux fuse client to load the file system.

Remember, Hadoop is based on Google File system (GFS), which is used to support Google's bigtable, while BigTable is a column-oriented database. For this reason, Hadoop is more likely to support the column store tools mentioned earlier. In this context, many tools have developed a Hadoop-oriented interface, so they can use Hadoop to store data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More