Hadoop: A stable, efficient and flexible large data processing platform

Source: Internet
Author: User
Keywords Can work DFS group large data

  

If you talk to people about big data, you'll soon be turning to the yellow elephant--hadoop (it's marked by a yellow elephant). The open source software platform is launched by the Apache Foundation, and its value lies in its ability to handle very large data in a simple and efficient way.

But what is Hadoop? To put it simply, Hadoop is a software framework that enables distributed processing of large amounts of data. First, it saves a large number of datasets in a distributed server cluster, which then runs a "distributed" data analysis application in each server cluster.

What's so special about Hadoop? First, it's very reliable, and even if one or a group of servers goes down, your big data application still works. Second, it is highly efficient and does not require a large amount of data to be transferred back and forth from the entire network during the process.

For Hadoop, the official description of Apache is this:

The Apache Hadoop software framework uses a simple programming model to distribute and process large data. It is designed to extend from one server to thousands of machines, and each individual provides local computing and storage capabilities. Unlike other software platforms that rely on hardware to ensure reliability, Hadoop has its own bug detection and processing capabilities in the application layer. As a result, it can provide a useful service only through a cluster of computers.

More amazingly, Hadoop is almost modular, which means that you can replace each of its components with any other software tool. Thus, Hadoop is not only reliable, efficient, but also has a flexible organizational structure.

Hadoop Distributed File System (HDFS)

Well, if you still don't know what Hadoop is (I think it's a bit abstract), then you just have to remember that the two most important components of Hadoop are enough. One is its data processing framework and the other is its distributed File system (distributed filesystem) for storing data. Of course, the internal composition of Hadoop is quite complex, but as long as there are two components, it can be basically run.

This distributed file system is the distributed storage cluster we mentioned above. To be blunt, it is the part of Hadoop that "grasps" the actual data. Although Hadoop can also use other file systems for data storage, it uses this Distributed File System (HDFS) to accomplish this task by default.

For example, HDFs is like a big bucket, and your data can be put in it. Until you call them, these "lovely little ones" are obedient to stay in the safe bucket. Whether you want to analyze in Hadoop or to fetch data from Hadoop and use other tools, HDFS can provide you with the smoothest and most reliable running experience.

Data Processing Framework &mapreduce

(Translator Note: MapReduce is a google-developed C + + programming tool for parallel operations of large datasets)

The data-processing framework, in fact, is a tool that works in concert. By default, it is also called MapReduce, a java-based system. You may be more familiar with MapReduce than HDFs. The cause of the situation may be mainly for two reasons:

1. It (MapReduce) is a real tool for data processing

2. When people use it to data processing, the feeling is very cool

In the so-called "normal" relational database, the search and analysis of data should be based on the standard Structured Query Language (SQL). Non-relational databases also use query languages, but are not limited to SQL. It can fetch the required data from the data store with the help of other query languages. As a result, the query language has also become an "unstructured search language" (NoSQL).

In fact, Hadoop is not a database. Although it can help you store data and allow you to arbitrarily access it, it does not involve any query language (both SQL and other languages). Since it's not just a data storage system, Hadoop needs a mapreduce-like system to help it process data.

MapReduce to undertake a series of work. In fact, each job is a separate Java application that, after entering the database, can pull out different information as needed. Replacing the traditional query language with MapReduce can make the data search more accurate and flexible. On the other hand, however, this also increases the complexity of data processing.

However, there are also tools to simplify the process. For example, Hadoop also includes an application called Hive (also developed by the Apache leader). It can transform the query language into a mapreduce work. The process seems to be simplified, but this batch method can only work with one job at a time. As a result, trying to use Hadoop is more like using a data warehouse than a data analysis tool.

Dispersed in a cluster

Another unique feature of Hadoop is that all of the above features are distributed systems, which differ greatly from the central system in traditional databases.

In a database that requires multiple machines to support, "work" is often assigned: All data is stored in one or more machines, and data processing software is controlled by another or a group of servers.

In the Hadoop group, the HDFS and mapreduce systems exist in every machine in the group. There are two advantages to doing this: one, even if one server in the group is down, the whole system still works; second, it puts data storage and processing system together, can improve the speed of data retrieval.

Anyway, as I said, Hadoop is absolutely reliable and efficient.

When MapReduce receives an information query instruction, it immediately launches two "tools" for processing. One is the "Work Tracker" (Jobtracker), located on the main node of Hadoop, and the "Task Tracker" (Tasktrackers), located on every network node in Hadoop.

The entire process is linear: first, by running the map function, the work tracker splits the operations into well-defined modules and delivers them to the task Tracker (the task tracker is located in the group machine where the data is needed). Second, by running the reduced function, the exact data corresponding to the previous definition and the required data from other machines in the group are sent back to the central node of the Hadoop group.

The distribution of HDFs is very similar to the way it is described above. The location of a single master node to track data in a group rental server. The master node is also called the data node, and the data is stored on the data block in the data node. HDFs the data block, typically 128MB as a replication unit, and then scatter the replicated data into each node in the group.

This distributed "style" gives Hadoop another big advantage: since its data and processing systems are placed in the same server in the group, your entire system's hardware space and computing power is upgraded to one level each time your group adds a single machine.

With your Hadoop

As I mentioned earlier, even users don't have to tangle with HDFs or mapreduce. Because Hadoop also provides a "resilient computing cloud solution." This program mainly refers to Amazon for its simple cloud storage services, as well as the DataStax company's Open source product brisk (based on the Apache Cassandra Cassandrafs System), it replaces the HDFs, assume the Hadoop Distributed file system work.

To avoid the limitations of MapReduce's "linear" approach, Hadoop also has an application called cascading, which provides developers with a simple tool and increases the flexibility of work.

Objectively speaking, the current Hadoop is not a perfect, extremely simple large data processing tools. As I mentioned, many developers are accustomed to just being a simple and fast storage tool because of the limitations of MapReduce in dealing with data. As a result, the advantages of Hadoop processing data cannot be realized.

However, Hadoop is indeed the best and most widely used large data processing platform. Especially when you don't have the time or money to use a relational database to store huge amounts of data. As a result, Hadoop always puts an elephant in the "house" of large data and waits for data processing at any time.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.