Five reasons why you have to use Hadoop for large data processing

Source: Internet
Author: User
Keywords Very cost

Joe http://www.aliyun.com/zixun/aggregation/33805.html ">brightly, a huge fan of Hadoop, I've been on countless occasions to admit that I love Hadoop for data processing, such as "You can handle petabytes of data, you can scale to thousands of nodes that handle a lot of computing work, and you can store and load data in a very flexible way ..." But when he deployed Hadoop for large data-processing analysis, he realized that it was not omnipotent.

In Quantivo,joe and his colleagues have "explored many ways to deploy Hadoop to answer analytic queries", until finally, "it became as if it were a hammer to build a house movement", which was not impossible, but brought "unnecessary pain and ridiculous inefficiencies". Joe has analyzed five reasons why data analysis does not use Hadoop: 1: "Hadoop is a framework, not a solution"-he thinks that Hadoop can work immediately and efficiently in solving large data analysis, but actually "it is possible for simple queries." But for difficult analysis problems, Hadoop will quickly fail, as it requires you to develop map/reduce code directly. For this reason, Hadoop is more like the Java EE programming environment than the Business Analytics solution. "The so-called framework means that you have to do personalized and business-related development and implementation on top of it, and these all require cost."

2: "Hadoop's subprojects hive and pig are good, but not beyond their architectural limitations." "--joe proposed" Hive and pig are perfect tools to help non-professional engineers use Hadoop quickly and efficiently, to transform analytic queries into common SQL or Java map/reduce Tasks that can be deployed in a Hadoop environment. "Where hive is a data warehousing tool based on Hadoop that helps with data aggregation, instant queries, and analysis of large datasets stored in Hadoop-compliant file systems." Pig is an advanced data flow language and execution framework for parallel computing. But the authors argue that "some of the limitations of the map/reduce framework of Hadoop can lead to inefficiencies, particularly in the case of communication between nodes (which requires sorting and connectivity)." ”

3: "Deployment is very convenient, fast and free, but the cost of late maintenance and development is very high"--joe does not deny that engineers can download, install and publish a simple query within one hours, so Hadoop is very popular. And as open source projects without software costs make it a very attractive alternative to Oracle and Teradata. But like many general-purpose open source frameworks, it does not fully fit your business, so if you want to make the open source framework business, you have to put it into development and maintenance. Joe also thinks that once you get into the maintenance and development phase, the real cost of Hadoop becomes apparent. ”

4: "For large data pipelining and summary is very effective, but for the application of specific analysis is very scary." "--" Hadoop is good at analyzing and summarizing large amounts of data, or converting raw data into an ' pipelining ' that is more effective for another application (such as search or text mining)--that's what it means. However, if you don't know what to analyze, or if you want to explore the patterns of data, Hadoop quickly becomes unmanageable. "This is back to the business itself, and the framework is for business services, and even the analysis and aggregation of large data is difficult to break away from the business characteristics of its data." So for a specific analysis, you still have to spend a lot of time programming and executing MapReduce code to achieve your goal.

5: "Performance is good except ' bad '." "--" when you need to analyze a lot of data, Hadoop allows you to compute parallel through thousands of nodes, which has great potential. However, not all analytical work can be easily handled in parallel, especially when user interaction-driven analysis is required. "So if you want to perform well, you still need to design and optimize the Hadoop program specifically for the problem you want to solve, otherwise it will be slow." "Because each map/reduce task waits until the previous work is done." "So, like the critical path, the speed with which Hadoop performs performance depends on its slowest mapreduce task."

Joe concludes: "Hadoop is an excellent tool for doing some very complex data analysis. Ironically, however, it also requires a lot of programming work to get answers to these questions. "This is not only in data analysis applications, it actually reflects the current use of open source framework to face the problem of selection balance." When you're choosing an open source framework or code, think about how much it can help you, how much time and cost, and how much more efficient. Also know how much of the new costs are generated by this, for example, engineers ' learning costs, development and maintenance costs, and future scalability, including the need to upgrade your and your team if the framework is used, and even security considerations, the open source framework flaw is well known.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.