The basics of Hadoop

Last Update:2014-12-25 Source: Internet

Author: User

Keywords DFS large data while developers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Now Apache Hadoop has become the driving force behind the development of the big data industry. Techniques such as hive and pig are often mentioned, but they all have functions and why they need strange names (such as Oozie,zookeeper, Flume).

Hadoop has brought in cheap processing of large data (large data volumes are usually 10-100GB or more, with a variety of data types, including structured, unstructured, etc.) capabilities. But what's the difference?

Today enterprise data warehouses and relational databases are good at handling structured data and can store large amounts of data. But the cost is somewhat expensive. This requirement for data limits the types of data that can be processed, and the drawbacks of this inertia affect the search for agility in data warehouses when confronted with massive amounts of heterogeneous data. This usually means that valuable data sources are never mined within the organization. This is the biggest difference between Hadoop and traditional data processing methods.

This paper focuses on the components of the Hadoop system and explains the functions of each component.

The core of Mapreduce--hadoop

Google's web search engine, while benefiting from the algorithm's role, MapReduce played a huge role in the background. The MapReduce framework has become the most influential "engine" behind today's large data processing. In addition to Hadoop, you'll also find MPP on MapReduce (Sybase IQ launches a listing database) and NoSQL (such as Vertica and MongoDB).

The important innovation of MapReduce is that when processing a large dataset query, it decomposes its task and processes it in multiple nodes running. When the data volume is very large can not solve the problem on a server, at this time distributed computing advantage is reflected. Combining this technology with a Linux server is a highly cost-effective way to replace a large scale array of calculations. Yahoo saw the future potential of Hadoop in 2006, and invited Hadoop founder Doug Cutting to develop Hadoop technology, which in 2008 Hadoop was already a certain size. The Hadoop project absorbs a number of other components at the same time as it matures in the early stages of development to further improve usability and functionality.

HDFs and MapReduce

We discussed the ability of mapreduce to distribute tasks to multiple servers to handle large data. For distributed computing, each server must have access to data, which is the role of HDFs (Hadoop distributed File System).

The combination of HDFs and MapReduce is powerful. During processing of large data, the entire calculation process does not terminate when the server in the Hadoop cluster fails. At the same time, Hfds can guarantee data redundancy in the event of fault error in the whole cluster. Writes the result to a node in Hfds when the calculation completes. HDFs the stored data format is not demanding, the data can be unstructured or other categories. The opposite relational database needs to structure and define the schema before it stores the data.

The developer writes code responsibility to make the data meaningful. Hadoop MapReduce-level programming utilizes Java APIs and can manually load data files into HDFs.

Pig and Hive

For developers, direct use of Java APIs can be tedious or error-prone, while limiting the flexibility of Java programmers to program on Hadoop. So Hadoop offers two solutions that make Hadoop programming easier.

Pig is a programming language that simplifies common work tasks for Hadoop. Pig can load data, express transformation data, and store final results. Pig built-in operations make semi-structured data meaningful (such as log files). At the same time, pig can extend the use of custom data types added in Java and support data conversion.

Hive plays the role of Data Warehouse in Hadoop. Hive adds data to the structure of HDFS (hive superimposes businessesflat-out on data in HDFS) and allows data queries to be used similar to SQL syntax. Like pig, Hive's core functionality is extensible.

Pig and hive are always puzzling. Hive is more suitable for the tasks of the Data Warehouse, hive is mainly used for static structure and the work that needs to be analyzed frequently. Hive, like SQL, makes it an ideal intersection between Hadoop and other bi tools. Pig gives developers more flexibility in large data sets and allows the development of concise scripts to transform data streams for embedding into larger applications. Pig is relatively lightweight compared to hive, and its main advantage is that it can drastically reduce the amount of code compared to using the Hadoop Java APIs directly. Because of this, pig still attracts a large number of software developers.

Improve data access: HBase, Sqoop, and Flume

The Hadoop core is a batch system with data loaded into HDFS, processed and then retrieved. For computing This is somewhat regressive, but usually interactive and random access data is necessary. HBase runs on HDFs as a column-oriented database. HBase Google BigTable as the blueprint. The goal of the project is to quickly locate and access the required data within billions of rows of data in the host. HBase uses MapReduce to handle the massive amount of data inside. At the same time, both hive and pig can be used in combination with hbase, hive and pig provide high-level language support for hbase, making it very easy to do data statistics processing on hbase.

But in order to authorize the random storage of data, HBase also made some restrictions: for example, Hive and HBase performance than the original HDFs hive is 4-5 times slower. At the same time HBase can store PB-level data, compared to HDFS capacity limit of 30PB. HBase is not suitable for hoc analysis, hbase is better suited to integrate large data as part of large applications, including logs, calculations, and time series data.

Getting data and output data

Sqoop and flume can improve data interoperability and the rest of the process. The Sqoop function is to import data from a relational database into Hadoop and import it directly into Hfds or hive. The flume design is designed to import streaming data or log data directly into HDFs.

Hive's friendly SQL queries are ideal for connecting to a wide variety of databases, with database Tools connected through JDBC or ODBC database drivers.

Zookeeper and Oozie responsible for coordinating workflow

As more and more projects join the Hadoop family and become part of the cluster system, large data processing systems require members who are responsible for coordinating their work. With the increase of compute nodes, cluster members need to sync with each other and understand where to access the service and how to configure it, zookeeper.

The tasks that are performed in Hadoop sometimes need to connect multiple map/reduce jobs together, and perhaps batch dependencies between them. The Oozie component provides the ability to manage workflows and dependencies without the need for developers to write custom solutions.

Ambari is the latest project to join Hadoop, and the Ambari project is designed to incorporate core features such as monitoring and management into the Hadoop project. Ambari can help system administrators deploy and configure Hadoop, upgrade clusters, and monitor services. It can also be integrated with other system management tools via API.

The Apache whirr is a set of class libraries (including Hadoop) running in the cloud to provide a high degree of complementarity. WHIRR is now relatively neutral and currently supports Amazon EC2 and Rackspace services.

Machine learning: Mahout

The different needs of various organizations lead to a variety of related data, the analysis of these data needs a variety of methods. Mahout provides a number of scalable machine learning domain classic algorithms, designed to help developers create smart applications more easily and quickly. Mahout contains many implementations, including clustering, classification, recommended filtering, frequent child mining.

Using Hadoop

Typically, Hadoop is applied to a distributed environment. As in previous Linux situations, vendors integrated and tested the components of the Apache Hadoop ecosystem and added their own tools and management capabilities.

Original link: http://cloud.csdn.net/a/20120220/312061.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More