Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞
Hadoop is often identified as the only solution that can help you solve all problems. When people refer to "Big data" or "data analysis" and other related issues, they will hear an blurted answer: hadoop! Hadoop is actually designed and built to solve a range of specific problems. Hadoop is at best a bad choice for some problems. For other issues, choosing Hadoop could even be a mistake. For data conversion operations, or more broadly, extraction-conversion-load operations (Translator Note: Extraction transformation Load,etl, the classic definition of data from the initial state to the available state process in the Data Warehouse), There are many benefits to using a Hadoop system, but if your problem is one of the following 5 categories, Hadoop may be an inappropriate solution.
1. The desire for large data
Many people believe they have real "big" data, but this is usually not the case. When considering the size of the data and the idea that most people are dealing with "big data", we should refer to this research paper, and no one will be dismissed for buying a cluster server, which tells us some interesting facts. Hadoop is designed to handle data in terabytes or petabytes, and most computing tasks in the world deal with input data below 100GB. (Microsoft and Yahoo have a median of 14GB in this statistic, while the 90% Facebook task deals with data below 100GB). In this case, the vertically scalable solution will outperform the Scale-out (scale-out) solution in terms of performance.
(Translator note: longitudinal expansion scale-up usually refers to adding or replacing hardware such as memory, CPU, hard disk or network device on one machine to achieve the overall performance of the system, horizontal extension (scale-out) is to enhance the overall performance of the cluster system by adding machines to the cluster. In this paper, the experiment of evaluating the performance index of the Hadoop system is compared. The conclusion is that in some cases, vertical scaling on one machine is more efficient than adding machines to the Hadoop cluster. This conclusion breaks the simple knowledge of the Hadoop system that most people have to do with a few inexpensive machines to get the best overall performance. ）
so you need to ask yourself:
Do I have more than a few terabytes of data?
I have stable, massive input data?
How much data do I have to manipulate and handle?
2. You're in the queue
when you submit a computing task in a Hadoop system, the minimum latency is 1 minutes. This means that the system will take 1 minutes to purchase information about the customer's product to respond and provide the relevant product recommendations. This requires the system to have very loyal and patient customers, staring at the computer screen for more than 60 seconds to wait for the result to appear. A good solution is to put each item in the inventory in a predetermined calculation of the relevant items on Hadoop. Then provide a Web site, or a mobile application to access the stored results, up to 1 seconds or less for immediate response. Hadoop is a very good large data engine for up-front computing. Of course, as the data that needs to be returned becomes more complex, full up-front computing becomes less efficient.
so you need to ask yourself:
What is the approximate range of system response times that
which computing tasks can be run through batch processing?
: The original author should have used the classic product recommendation feature on the business ecommerce website as a use case to describe how to implement this feature in Hadoop. ）
3. How long will your problem be answered
Hadoop is not a good solution for a problem that requires real-time response to queries. Hadoop's computational tasks take time on map and reduce, and in the shuffle phase. None of these processes can be done in a limited time, so Hadoop is not suitable for developing applications that have real-time requirements. A practical example is the calculation of the volume weighted average price (volume-weighted average Price,vwap) used in the futures or stock market programmed trading System (program Trading), which is usually real-time. This requires the trading system to give the user the results within a limited time so that they can trade.
(Translator Note: The shuffle process in the mapreduce of Hadoop refers to the assignment of the results of multiple map tasks to one or more reduc tasks, the data shuffling and assigning operations, this blog explains in more detail, http://langyu.iteye.com/blog/992916. The use case here is how to calculate the benchmark price of a stock or a futures transaction in the process of an investment bank. This calculation I think every time the query response time to the data should be below 100ms, see http://baike.baidu.com/view/1280239.htm,http://baike.baidu.com/view/945603.htm. In this case, it is believed that the xdjm of investment banks should have more say. ）
for data analysts, they really want to use query languages such as SQL. The Hadoop system does not support the immediate access to data stored on Hadoop. Even if you use hive to help transform your SQL-like queries into specific MapReduce computing tasks, random access to data is not the forte of Hadoop. Google's Dremel system (and its extended, BigQuery system) is designed to return massive amounts of data within seconds. Revelation SQL also supports various join operations between data tables well. Other technical solutions that support real-time responses include Amplab projects from the shark of the Berkley California (University of California, Berkeley), and Horntoworks-led stinger projects.
so you need to ask yourself:
What are the interactions and real-time requirements of data access that your users and analysts expect?
do your users want to have access to terabytes of data, or do you need to access only a subset of the data?
(Translator Note: The Apache Hive is an Open-source project in the Hadoop ecosystem, with the main purpose of providing near-ANSI SQL data operations on the Hadoop system, so that data analysts familiar with SQL language are queried for data on Hadoop.) The Dremel system, a real-time query system developed by Google to support large data, utilizes a well-designed column-type storage structure and a large scale parallel query mechanism, which enables the ability to analyze and query 1PB data within 3 seconds (English-language paper, Chinese translation). BigQuery is an open SaaS service developed by Google based on Dremel that can operate on a large amount of data. Berkeley data Analytics Stack, Bdas is a large database platform based on Hadoop for Amplab, including multiple open source projects, as described in Https://amplab.cs.berkeley.edu/software/. The Spark project is a project in Bdas that uses Scala language development to provide a SQL-like data manipulation interface that is fully compatible with hive. The main feature is to use the spark to translate the query into a specific computing task. Spark will speed up queries and computations by using large amounts of memory on nodes in the Hadoop cluster for data caching and real-time computing in memory. See http://shark.cs.berkeley.edu/. Hortonworks is one of several companies currently focused on providing large data systems and applications based on Hadoop, Stinger is used to horontoworks a series of projects and improvements based on Hadoop to enhance hive query performance, The main method is to optimize the hive file storage format and analyze and optimize the query request for hive. ）
We should realize that Hadoop works in batch mode. This means that when new data is added, the computational task of data processing needs to be rerun on the entire data set. So, as the data grows, the time for data analysis increases. In practice, the increase in small pieces of new data, a single type of data change, or the updating of micro-data will occur in real time. Typically, business processes need to make decisions based on these events. However, no matter how quickly the data is entered into the Hadoop system, it is still a batch process when Hadoop processes the data. The MapReduce framework for Hadoop 2.0 yarn promises to address this problem. The storm platform used by Twitter is another viable, popular alternative. Combining storm with distributed messaging systems such as Kafka can support various requirements for streaming data processing and aggregation. The pain is that the current storm does not support load balancing, but it is available in the S4 version of Yahoo.
so you need to ask yourself:
How long is the lifecycle of my data?
how quickly does my business need to derive value from input data?
How important is
for my business to respond to real-time data changes and updates?
real-time advertising applications and collection of sensor monitoring applications require the real-time processing of convective data. Hadoop and the tools on top are not the only options for solving such problems. In the recent Indy 500, the McLaren team used SAP's HANA memory database products in their Atlas system for data analysis, and combined with MATLAB to perform various simulations to analyze and calculate the telemetry data obtained in real time in the race. Many data analysts believe that Hadoop's future lies in its ability to support real-time and interactive operations.
(Translator Note: Yarn is a new framework for resource management and task processing Hadoop2.0, which is different from MapReduce, which boasts a broader programming model than MapReduce, while supporting real-time querying and computing tasks, as described in http:// hortonworks.com/hadoop/yarn/. Storm is a Twitter-led open source project, is a distributed data processing system, its main feature is to be able to support real-time data processing of high demand, see http://storm-project.net. Taobao and Alibaba are using storm. Simple scalable streaming system, S4 is another real-time stream data processing distributed system created by Yahoo, see http://incubator.apache.org/s4/. Here's a page that quotes a lot more than Yahoo S4 and storm articles, http://blog.softwareabstractions.com/the_software_abstractions/2013/06/ Links-comparing-yahoo-s4-and-storm-for-continuous-stream-processing-aka-real-time-big-data.html. Kafka is an open source project for Apache, http://kafka.apache.org/. Hana is a commercial product launched by SAP and is a scalable memory database solution that supports real-time large data analysis and computation. See Http://www.sap.com/HANA. MATLAB is a MathWorks company developed for the development of scientific computing products, www.mathworks.com/products/matlab. The McLaren team is the famous British F1 team, which is a very successful team in the F1 formula competition. They also took part in the famous American Indy 500 race. They use a large data platform to process car data to improve the performance of the car story can read this article, http://blogs.gartner.com/doug-laney/the-indy-500-big-race-bigger-data/
4. I just broke up with my social network
Hadoop, especially the MapReduce framework, is the best choice when data can be decomposed into key-value pairs without fear of losing context or the implicit relationship between some data. However, the data structure such as the graph contains a variety of implicit relationships, such as the edge of the graph, the subtree, the parent-child relationship between nodes, weights, and so on, and not all of these relationships can be represented on a node in the graph. Such a feature requires that the algorithm of the graph be used to add the complete or partial information of the current graph in each iteration calculation. Such algorithms are largely impossible to implement with the MapReduce framework, even if they can be implemented is a very roundabout solution. Another issue is how to develop strategies for slicing data into different nodes. If the main data structure you're working with is graph or network, you'd better choose to use a graph-oriented database, such as Neoj or Dex. Or you can study the latest Google Pregel or Apache Giraph project.
so you need to ask yourself:
is the underlying structure of my data as important as the data itself?
I hope that the inspiration and insights from the structure of the data are as important or even important as the data itself?
(Translator Note: Neoj has commercial and GPL dual license mode, see Http://www.neo4j.org/,Dex is a commercial product, see Http://www.sparsity-technologies.com/dex. Apache Giraph Project Http://giraph.apache.org is based on Google Pregel thesis http://dl.acm.org/citation.cfm?id=1807184, http:// Kowshik.github.io/jpregel/pregel_paper.pdf's Open source implementation is a large data processing platform that is used to analyze social networks so that they can be abstracted into graph or network data structures. ）
Many computational tasks, jobs, and algorithms are inherently unsuitable for using the MapReduce framework. One of these issues has been addressed in the previous chapter. Another problem is that some computational tasks require the results of the previous calculation to perform the current step of the calculation. A mathematical example is the calculation of the Fibonacci sequence. Some machine learning algorithms, such as gradients and maximum expectations, are also not ideal for using mapreduce patterns. Many researchers have given suggestions for implementing the specific optimizations and policies needed in these algorithms (global state, referencing data structures in calculations), but using Hadoop to implement specific algorithms can become complex and difficult to understand.
so you need to ask yourself:
does my business have a very high demand for specific algorithms or domain-related processes?
technical team have sufficient capacity and resources to analyze the algorithm to use the MapReduce framework?
(Translator Note: Gradient methods, gradient method is commonly used in mathematical optimization calculations, see HTTP://ZH.WIKIPEDIA.ORG/WIKI/%E6%A2%AF%E5%BA%A6%E4%B8%8B%E9%99%8D%E6 %b3%95. The maximum expectation algorithm, maximization expectation algorithm, is commonly used in probabilistic models and corresponding machine learning algorithms, http://zh.wikipedia.org/zh-cn/%E6%9C%80%E5%A4%A7%E6 %9c%9f%e6%9c%9b%e7%ae%97%e6%b3%95)
In addition, there are other situations that need to be considered, for example, the amount of data is not large, or the dataset, though large, is mainly made up of billions of small files and cannot be spliced (for example, many graphics files need to be entered in different shapes). As we've said before, for computing tasks that are not suitable for mapreduce partitioning and merging principles, using Hadoop to implement them can make Hadoop more difficult to use.
now that we've analyzed the scenarios under which Hadoop is inappropriate, let's take a look at the circumstances in which Hadoop is the right choice.
you need to ask yourself whether your organization,
want to extract information from a stack of text-formatted log files?
Want to convert most unstructured or semi-structured data into useful, structured formats?
is there a computing task that runs on the entire data set every night? (for example, the credit card company handles all daytime transactions at night)
the conclusions obtained from a single data processing are consistent with the conclusions to be dealt with in the next plan (unlike the price of the stock market, which is changing every day)?
if all the above answers are "Yes," then you should delve into Hadoop.
the several types of issues mentioned above represent a considerable portion of the business problems that can be solved with Hadoop (although many industry reports conclude that it is not easy to deploy these types of Hadoop systems to production environments). For some computing tasks, the computational model of Hadoop is appropriate. For example, you need to deal with a huge amount of unstructured or semi-structured data, and then summarize the content or translate the results into structured data and provide the results to other components or systems. If the collected data can easily be converted to a bit of ID and its corresponding content (in the terms of Hadoop is a key-value pair, key-value pair), then you can use this simple association to do different kinds of rollup calculations.
in general, the key is to recognize the resources you have and understand the nature of the problems you want to solve. With some of the points mentioned in this article and your own understanding and understanding, you will be able to choose the tools that best suit you. In some cases, the final solution is probably Hadoop.
What are your experiences and lessons in using Hadoop? Please share in the comments.
This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or
reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or
complaint, to firstname.lastname@example.org. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
and provide relevant evidence. A staff member will contact you within 5 working days.