Illustrated: 5 minutes to learn about Hadoop

Last Update:2014-12-25 Source: Internet

Author: User

Keywords Can massive data these write

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop was the name of a Hadoop developer's child giving his elephant toy. Because the original data storage and processing tools to deal with the internet bubble began to appear after the massive data appears powerless, so the development of Hadoop. First, Google proposes a mapreduce architecture that can cope with the flow of data generated by consolidating global information tasks and make the data highly accessible. Since then, in 2005 Yahoo developed a mapreduce architecture based Hadoop. It was officially released as an Open-source tool under the Apache License agreement in 2007.

Over the years, Hadoop has become an oversized operating system, especially in distributed parallel processing, which is applied to the massive data generated today. Like the usual operating system, Hadoop includes a complete file system that can write programs, and distributed management of these programs and returns the results of calculations.

Hadoop supports data-intensive distributed applications that can run simultaneously on large clusters built on inexpensive hardware devices. It is published under the license agreement of the Apache v2. The Hadoop network is reliable, stable and highly scalable, and can be used to query a large set of data. Hadoop is written in Java, which means it can run on any platform and be used by global distributors and large data technology vendors that have built other software layers on top of Hadoop.

The Hadoop Distributed File System (HDFS) feature makes Hadoop very useful. This is the storage system for Hadoop, which decomposes data into smaller chunks called "blocks." The blocks are then distributed throughout the cluster. This data distribution allows the function of mapping (map) and simplification (Reduce) to be performed on a smaller subset rather than on a large dataset. This improves efficiency, saves processing time, and increases the scalability required to handle massive amounts of data.

MapReduce is a software framework and model that can process and retrieve massive amounts of data stored in a Hadoop system in parallel. MapReduce's library has been written in a variety of programming languages, so Hadoop can work with them. In addition, MapReduce can handle structured and unstructured data.

MapReduce's work is divided into two steps. The first step is "map", which splits the data into smaller subsets and assigns the subsets to different nodes in the cluster. The nodes within the system can perform this operation again, resulting in a hierarchical tree structure that divides the data into smaller subsets. The data is processed on these nodes and the results are returned to the master node. The second step is "simplification". The master collects all the returned data and combines them into some form of reusable output. The MapReduce framework can manage all kinds of parallel tasks and the entire system and form the heart of Hadoop.

By fusing multiple technologies, massive data can be easily stored and processed and analyzed in a fraction of a second. Over the past few years, the large data ecosystem Hadoop has proven to be very successful and looks promising. With the development of Hadoop2.0, it now uses a new working-processing framework called yarn. Yarn is the abbreviation of verb Another Resource negotiator, an application scheduling module that manages cluster computing resources. Yarn enables multiple data processing engines, such as interactive SQL, real-time streaming media, data science, and batch processing, to process information stored on a single platform and create a new approach to analysis.

Hadoop is a powerful tool, with more than 25% of companies using Hadoop to manage their data since 2005, up 10% per cent year-on-year in 2012. Here are some reasons why businesses use it:

Low cost;

High computational power;

High scalability;

Storage flexibility;

Data protection.

Hadoop is used in almost all industries, from retailing to government, to finance. The following information graph produced by Solix provides a more in-depth description of Hadoop and some interesting predictions.

Original link: http://www.bigdata-startups.com/BigData-startup/what-is-hadoop-and-five-reasons-organisations-use-hadoop-infographic/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More