Hadoop is not the only solution to big data problems

Last Update:2014-12-18 Source: Internet

Author: User

Keywords HTTP very real time or to solve

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Page 1th: The desire for large data

Hadoop is often identified as the only solution that can help you solve all problems. When people refer to "Big data" or "data analysis" and other related issues, they will hear an blurted answer: hadoop! Hadoop is actually designed and built to solve a range of specific problems. Hadoop is at best a bad choice for some problems. For other issues, choosing Hadoop could even be a mistake. For data conversion operations, or more broadly, extraction-conversion-load operations (Translator Note: Extraction transformation Load,etl, the classic definition of data from the initial state to the available state process in the Data Warehouse), There are many benefits to using a Hadoop system, but if your problem is one of the following 5 categories, Hadoop may be an inappropriate solution.

1. The desire for large data

Many people believe they have real "big" data, but this is usually not the case. When considering the size of the data and the idea that most people are dealing with "big data", we should refer to this research paper, and no one will be dismissed for buying a cluster server, which tells us some interesting facts. Hadoop is designed to handle data in terabytes or petabytes, and most computing tasks in the world deal with input data below 100GB. (Microsoft and Yahoo have a median of 14gb in this statistic, while the 90% Facebook task deals with data below 100GB). In this case, the vertically scalable solution will outperform the Scale-out (scale-out) solution in terms of performance.

(Translator note: longitudinal expansion scale-up usually refers to adding or replacing hardware such as memory, CPU, hard disk or network device on one machine to achieve the overall performance of the system, horizontal extension (scale-out) is to enhance the overall performance of the cluster system by adding machines to the cluster. In this paper, the experiment of evaluating the performance index of the Hadoop system is compared. The conclusion is that in some cases, vertical scaling on one machine is more efficient than adding machines to the Hadoop cluster. This conclusion breaks the simple knowledge of the Hadoop system that most people have to do with a few inexpensive machines to get the best overall performance. ）

So you need to ask yourself:

do I have more than a few terabytes of data?

I have stable, massive input data?

How much data do I have to manipulate and handle?

2. You're in the queue

when you submit a computing task in a Hadoop system, the minimum latency is 1 minutes. This means that the system will take 1 minutes to purchase information about the customer's product to respond and provide the relevant product recommendations. This requires the system to have very loyal and patient customers, staring at the computer screen for more than 60 seconds to wait for the result to appear. A good solution is to put each item in the inventory in a predetermined calculation of the relevant items on Hadoop. Then provide a Web site, or a mobile application to access the stored results, up to 1 seconds or less for immediate response. Hadoop is a very good large data engine for up-front computing. Of course, as the data that needs to be returned becomes more complex, full up-front computing becomes less efficient.

so you need to ask yourself:

What is the approximate range of system response times that
users expect?

which computing tasks can be run through batch processing?

: The original author should have used the classic product recommendation feature on the business ecommerce website as a use case to describe how to implement this feature in Hadoop. ）

3. How long will your problem be answered

Hadoop is not a good solution for a problem that requires real-time response to queries. Hadoop's computational tasks take time on map and reduce, and in the shuffle phase. None of these processes can be done in a limited time, so Hadoop is not suitable for developing applications that have real-time requirements. A practical example is the calculation of the volume weighted average price (volume-weighted average Price,vwap) used in the futures or stock market programmed trading System (program trading), which is usually real-time. This requires the trading system to give the user the results within a limited time so that they can trade.

(Translator Note: The shuffle process in the mapreduce of Hadoop refers to the assignment of the results of multiple map tasks to one or more reduc tasks, the data shuffling and assigning operations, this blog explains in more detail, http://langyu.iteye.com/blog/992916. The use case here is how to calculate the benchmark price of a stock or a futures transaction in the process of an investment bank. This calculation I think every time the query response time to the data should be below 100ms, see http://baike.baidu.com/view/1280239.htm,http://baike.baidu.com/view/945603.htm. In this case, it is believed that the xdjm of investment banks should have more say. ）

for data analysts, they really want to use query languages such as SQL. The Hadoop system does not support the immediate access to data stored on Hadoop. Even if you use hive to help transform your SQL-like queries into specific MapReduce computing tasks, random access to data is not the forte of Hadoop. Google's Dremel system (and its extended, BigQuery system) is designed to return massive amounts of data within seconds. Revelation SQL also supports various join operations between data tables well. Other technical solutions that support real-time responses include Amplab projects from the shark of the Berkley California (University of California, Berkeley), and Horntoworks-led stinger projects.

so you need to ask yourself:

What are the interactions and real-time requirements of data access that your users and analysts expect?

do your users want to have access to terabytes of data, or do you need to access only a subset of the data?

We should realize that Hadoop works in batch mode. This means that when new data is added, the computational task of data processing needs to be rerun on the entire data set. So, as the data grows, the time for data analysis increases. In practice, the increase in small pieces of new data, a single type of data change, or the updating of micro-data will occur in real time. Typically, business processes need to make decisions based on these events. However, no matter how quickly the data is entered into the Hadoop system, it is still a batch process when Hadoop processes the data. The MapReduce framework for Hadoop 2.0 yarn promises to address this problem. The storm platform used by Twitter is another viable, popular alternative. Combining storm with distributed messaging systems such as Kafka can support various requirements for streaming data processing and aggregation. The pain is that the current storm does not support load balancing, but it is available in the S4 version of Yahoo.

2nd page:

so you need to ask yourself:

How long is the lifecycle of my data?

how quickly does my business need to derive value from input data?

How important is
for my business to respond to real-time data changes and updates?

real-time advertising applications and collection of sensor monitoring applications require the real-time processing of convective data. Hadoop and the tools on top are not the only options for solving such problems. In the recent Indy 500, the McLaren team used SAP's HANA memory database products in their Atlas system for data analysis, and combined with MATLAB to perform various simulations to analyze and calculate the telemetry data obtained in real time in the race. Many data analysts believe that Hadoop's future lies in its ability to support real-time and interactive operations.

(Translator Note: Yarn is a new framework for resource management and task processing hadoop2.0, which is different from MapReduce, which boasts a broader programming model than MapReduce, while supporting real-time querying and computing tasks, as described in http:// hortonworks.com/hadoop/yarn/. Storm is a Twitter-led open source project, is a distributed data processing system, its main feature is to be able to support real-time data processing of high demand, see http://storm-project.net. Taobao and Alibaba are using storm. Simple scalable streaming system, S4 is another real-time stream data processing distributed system created by Yahoo, see http://incubator.apache.org/s4/. Here's a page that quotes a lot more than Yahoo S4 and storm articles, http://blog.softwareabstractions.com/the_software_abstractions/2013/06/ Links-comparing-yahoo-s4-and-storm-for-continuous-stream-processing-aka-real-time-big-data.html. Kafka is an open source project for Apache, http://kafka.apache.org/. Hana is a commercial product launched by SAP and is a scalable memory database solution that supports real-time large data analysis and computation. See Http://www.sap.com/hana. MATLAB is a MathWorks company developed for the development of scientific computing products, www.mathworks.com/products/matlab. The McLaren team is the famous British F1 team, which is a very successful team in the F1 formula competition. They also took part in the famous American Indy 500 race. They use a large data platform to process car data to improve the performance of the car story can read this article, http://blogs.gartner.com/doug-laney/the-indy-500-big-race-bigger-data/

4. I just broke up with my social network

Hadoop, especially the MapReduce framework, is the best choice when data can be decomposed into key-value pairs without fear of losing context or the implicit relationship between some data. However, the data structure such as the graph contains a variety of implicit relationships, such as the edge of the graph, the subtree, the parent-child relationship between nodes, weights, and so on, and not all of these relationships can be represented on a node in the graph. Such a feature requires that the algorithm of the graph be used to add the complete or partial information of the current graph in each iteration calculation. Such algorithms are largely impossible to implement with the MapReduce framework, even if they can be implemented is a very roundabout solution. Another issue is how to develop strategies for slicing data into different nodes. If the main data structure you're working with is graph or network, you'd better choose to use a graph-oriented database, such as Neoj or Dex. Or you can study the latest Google Pregel or Apache Giraph project.

so you need to ask yourself:

is the underlying structure of my data as important as the data itself?

I hope that the inspiration and insights from the structure of the data are as important or even important as the data itself?

(Translator Note: Neoj has commercial and GPL dual license mode, see Http://www.neo4j.org/,dex is a commercial product, see Http://www.sparsity-technologies.com/dex. Apache Giraph Project Http://giraph.apache.org is based on Google Pregel thesis http://dl.acm.org/citation.cfm? id=1807184, an Open-source implementation of Http://kowshik.github.io/jpregel/pregel_paper.pdf, is a large data-processing platform for analyzing social networks that can be abstracted into graphs or network data structures. ）

5.mapreduce die

Many computational tasks, jobs, and algorithms are inherently unsuitable for using the MapReduce framework. One of these issues has been addressed in the previous chapter. Another problem is that some computational tasks require the results of the previous calculation to perform the current step of the calculation. A mathematical example is the calculation of the Fibonacci sequence. Some machine learning algorithms, such as gradients and maximum expectations, are also not ideal for using mapreduce patterns. Many researchers have given suggestions for implementing the specific optimizations and policies needed in these algorithms (global state, referencing data structures in calculations), but using Hadoop to implement specific algorithms can become complex and difficult to understand.

so you need to ask yourself:

does my business have a very high demand for specific algorithms or domain-related processes?

Does the
technical team have sufficient capacity and resources to analyze the algorithm to use the MapReduce framework?

(Translator Note: Gradient methods, gradient method is commonly used in mathematical optimization calculations, see HTTP://ZH.WIKIPEDIA.ORG/WIKI/%E6%A2%AF%E5%BA%A6%E4%B8%8B%E9%99%8D%E6 %b3%95. The maximum expectation algorithm, maximization expectation algorithm, is commonly used in probabilistic models and corresponding machine learning algorithms, HTTP://ZH.WIKIPEDIA.ORG/ZH-CN/%E6%9C%80%E5%A4%A7%E6 %9c%9f%e6%9c%9b%e7%ae%97%e6%b3%95)

In addition, there are other situations that need to be considered, for example, the amount of data is not large, or the dataset, though large, is mainly made up of billions of small files and cannot be spliced (for example, many graphics files need to be entered in different shapes). As we've said before, for computing tasks that are not suitable for mapreduce partitioning and merging principles, using Hadoop to implement them can make Hadoop more difficult to use.

now that we've analyzed the scenarios under which Hadoop is inappropriate, let's take a look at the circumstances in which Hadoop is the right choice.

you need to ask yourself whether your organization,

want to extract information from a stack of text-formatted log files?

Want to convert most unstructured or semi-structured data into useful, structured formats?

is there a computing task that runs on the entire data set every night? (for example, the credit card company handles all daytime transactions at night)

the conclusions obtained from a single data processing are consistent with the conclusions to be dealt with in the next plan (unlike the price of the stock market, which is changing every day)?

if all the above answers are "Yes," then you should delve into Hadoop.

the several types of issues mentioned above represent a considerable portion of the business problems that can be solved with Hadoop (although many industry reports conclude that it is not easy to deploy these types of Hadoop systems to production environments). For some computing tasks, the computational model of Hadoop is appropriate. For example, you need to deal with a huge amount of unstructured or semi-structured data, and then summarize the content or translate the results into structured data and provide the results to other components or systems. If the collected data can easily be converted to a bit of ID and its corresponding content (in the terms of Hadoop is a key-value pair, key-value pair), then you can use this simple association to do different kinds of rollup calculations.

in general, the key is to recognize the resources you have and understand the nature of the problems you want to solve. With some of the points mentioned in this article and your own understanding and understanding, you will be able to choose the tools that best suit you. In some cases, the final solution is probably Hadoop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More