Facebook expert: Hadoop is not enough to handle large data

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Large data for

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With the development and application of large data in various business areas, relevant technologies and tools are emerging, in which the Hadoop framework receives more attention and application. "Don't underestimate the value of relational database technology," says Ken Rudin, a Facebook analyst, who recently delivered a keynote address at a strata+hadoop World Congress in New York. He argues that the Hadoop programming framework may be synonymous with the "Big data" movement, but it is not the only tool for an enterprise to gain value from unstructured information stored on a large scale.

There are a lot of popular ideas about big data that need to be questioned, first and foremost that you can simply use Hadoop and that Hadoop is easy to use. The problem is that Hadoop is a technology, and big data and technology are irrelevant. Large data is related to business requirements. In fact, large data should include Hadoop and relational databases, as well as any other technology appropriate to our task at hand.

Rudin says Facebook's business model relies on its handling of user information and activity data for more than 1 billion social media users, providing targeted advertising. However, Hadoop is not always the best tool for what we need to do.

For example, it makes sense to do a wide and exploratory analysis of a dataset in Hadoop, but relational storage is better at running analysis of things that have not yet been discovered. Hadoop is good for finding the lowest level of detail in a data set, but relational databases are more meaningful for storing transformations and summaries of data. So the bottom line is, you need to use the right technology for whatever you need.

Another hypothesis, he says, is that the simple behavioral analysis of large data provides valuable value: "The problem is that the analysis gives more intelligent answers to the questions that are not available." It is still an art to figure out what is right. Facebook has been focused on hiring the right people to run their analytics operations, not only to get a ph. D. In statistics, but also to be proficient in business.

When interviewing employees, don't focus on "how do we calculate this", instead, you should give them a business case to study and ask them which is the most important indicator in this case. Enterprises should also try to cultivate, everyone participate in the analysis.

According to Rudin, Facebook runs an in-house "data training camp", a project that teaches employees how to analyze for two weeks. Product managers, designers, engineers and even financial staff are invited to attend. The significance of everyone's participation is that each person can use a common data language to discuss data issues and problems with each other.

Facebook has also changed the way statisticians and business teams are organized. If statisticians remain independent, they tend to sit and wait for requests from the business sector to come and respond to them instead of taking the initiative. But if statisticians are placed in a business unit, you will find that multiple groups will try to solve the problem in a redundant way.

Facebook has adopted an "embedded" model in which analysts are placed on the business team, but they have to report to a higher level of analysts, which helps avoid duplication of work.

For the techniques and methods of how Hadoop combines and processes large data, data expert Anoop once mentioned in another article that, in general, in order to get the final result, the data needs to be processed and federated together by adding multiple datasets. There are many ways to add multiple datasets to Hadoop. The MapReduce provides a data connection to the map end and the reduce end. These connections are nontrivial and can be very expensive operations. Pig and Hive also have the same ability to apply to connect to multiple datasets. Pig provides a replication connection, a merge connection and an oblique connection (skewed join), and Hive provides a map-side connection and a full external connection to analyze the data. An important fact is that by using various tools, such as mapreduce, pig, and hive, data can be used based on their built-in capabilities and actual requirements. As for the analysis of large amounts of data in Hadoop, Anoop points out that, in a world where large data is/hadoop, some problems may not be complicated, and the solution is straightforward, but the challenge is the amount of data. In this case, different solutions are needed to solve the problem. Some analysis tasks are to count the number of clear IDs in the log files, to transform the stored data within a specific date range, and to rank users. All of these tasks can be addressed through a variety of tools and techniques in Hadoop such as MapReduce, Hive, Pig, Giraph, and Mahout. These tools have the flexibility to extend their capabilities with the help of custom routines.

In fact, there are also several reasons why Hadoop is not suitable for data analysis, according to expert Joe Brightly, who holds the same opinion as Rudin, including:

"Hadoop is a framework, not a solution"-he thinks that Hadoop can work immediately and efficiently in solving big data analysis, but actually "it's OK for simple queries." But for difficult analysis problems, Hadoop will quickly fail, as it requires you to develop map/reduce code directly. For this reason, Hadoop is more like the Java EE programming environment than the Business Analytics solution. "The so-called framework means that you have to do personalized and business-related development and implementation on top of it, and these all require cost."

The subprojects hive and pig of Hadoop are good, but they do not exceed their architectural limitations. "--joe proposed" Hive and pig are perfect tools to help non-professional engineers use Hadoop quickly and efficiently, to transform analytic queries into common SQL or Java map/reduce Tasks that can be deployed in a Hadoop environment. "Where hive is a data warehousing tool based on Hadoop that helps with data aggregation, instant queries, and analysis of large datasets stored in Hadoop-compliant file systems." Pig is an advanced data flow language and execution framework for parallel computing. But the authors argue that "some of the limitations of the map/reduce framework of Hadoop can lead to inefficiencies, particularly in the case of communication between nodes (which requires sorting and connectivity)." ”

"Hadoop is an excellent tool for doing some very complex data analysis," concludes Joe. Ironically, however, it also requires a lot of programming work to get answers to these questions. "This is not only in data analysis applications, it actually reflects the current use of open source framework to face the problem of selection balance." When you're choosing an open source framework or code, think about how much it can help you, how much time and cost, and how much more efficient. Also know how much of the new costs are generated by this, for example, engineers ' learning costs, development and maintenance costs, and future scalability, including the need to upgrade your and your team if the framework is used, and even security considerations, the open source framework flaw is well known.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More