Big Data applications: Hadoop

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Large data for these

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today, the big data has become the theme of the Times, enterprises on the application of large data is also more in-depth, with the popularity of large data, there are many large data concepts need to be questioned, first of all is that people generally think you can simply use Hadoop, and Hadoop easy to use.

The problem is that Hadoop is a technology, and big data and technology are irrelevant. Large data is related to http://www.aliyun.com/zixun/aggregation/12445.html "> Business requirements. In fact, large data should include Hadoop and relational databases, as well as any other technology appropriate to our task at hand.

For example, it makes sense to do a wide and exploratory analysis of a dataset in Hadoop, but relational storage is better at running analysis of things that have not yet been discovered. Hadoop is good for finding the lowest level of detail in a data set, but relational databases are more meaningful for storing transformations and summaries of data. So the bottom line is, you need to use the right technology for whatever you need.

For the techniques and methods of how Hadoop combines and processes large data, data expert Anoop once mentioned in another article that, in general, in order to get the final result, the data needs to be processed and federated together by adding multiple datasets. There are many ways to add multiple datasets to Hadoop. The MapReduce provides a data connection to the map end and the reduce end. These connections are nontrivial and can be very expensive operations. Pig and Hive also have the same ability to apply to connect to multiple datasets. Pig provides a replication connection, a merge connection and an oblique connection (skewed join), and Hive provides a map-side connection and a full external connection to analyze the data.

In a world of big data/hadoop, some problems may not be complicated, and the solution is straightforward, but the challenge is the amount of data. In this case, different solutions are needed to solve the problem. Some analysis tasks are to count the number of clear IDs in the log files, to transform the stored data within a specific date range, and to rank users. All of these tasks can be addressed through a variety of tools and techniques in Hadoop such as MapReduce, Hive, Pig, Giraph, and Mahout. These tools have the flexibility to extend their capabilities with the help of custom routines.

Hadoop is a framework, not a solution, in solving large data analysis, people mistakenly believe that Hadoop can work immediately, but in fact for simple queries, it is possible. But for difficult analysis problems, Hadoop will quickly fail, as it requires you to develop map/reduce code directly. For this reason, Hadoop is more like the Java EE programming environment than the Business Analytics solution. "The so-called framework means that you have to do personalized and business-related development and implementation on top of it, and these all require cost."

Hadoop is an excellent tool for doing some very complex data analysis. Ironically, however, it also requires a lot of programming work to get answers to these questions. This is not only in the application of data analysis, it actually reflects the current use of open source framework to face the problem of selection balance. When you're choosing an open source framework or code, think about how much it can help you, how much time and cost, and how much more efficient. Also know how much of the new costs are generated by this, for example, engineers ' learning costs, development and maintenance costs, and future scalability, including the need to upgrade your and your team if the framework is used, and even security considerations, the open source framework flaw is well known.

Comments：

In the big data age, many people think Hadoop is easy to use, so in the application of large data, Hadoop is also in the forefront, however, Hadoop will also face unresolved problems, Hadoop is not omnipotent, so users in the process of using Hadoop should be able to do.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More