Facing the problem and complexity of Hadoop mapreduce

Source: Internet
Author: User
Keywords nbsp; complexity however some Java

"IT168 Technology" as one of the most representative of large data technology, Hadoop for those who are prepared to explore business impact data is very attractive to IT departments. Hadoop's distributed approach is better suited to dealing with massive unstructured data, but Hadoop and its associated MapReduce programming models are not a panacea, and mapreduce and Hadoop issues always affect the big data novice.

The problems that users who deploy Hadoop may encounter include complexity, performance, and system management. But there is still a growing concern about it. Marketanalysis.com estimates that the Hadoop MapReduce market will grow at a rate of 58% a year to $2.2 billion trillion 2018 years ago. Hadoop (originally developed by Yahoo) contains a lot of tools and is still growing. However, finding the right team member is not an easy task.

"When we talk about Hadoop, it's actually just a distributed file system," says Joe Caserta, president of Caserta Concepts, a data warehousing consultancy. All distributed systems have tools to operate distributed data system data. They include hive query tools to simulate structured data; Pig, which is actually an ETL language for manipulating Hadoop data. ”

For many people, the complexity of Hadoop stems from Java development.

Java is a popular language that is used for application-oriented development rather than data-oriented development. This is just one of many "environmental transitions" that new Hadoop faces.

Even Java developers who are proficient in SQL have many challenges with Hadoop. Hadoop cannot use SQL except for some related tools.

"Hadoop has a tool to help achieve this environmental change, that is hive," said Paul Dix, founder of Errplane, a consultant and application monitoring software developer. The tool works, but the transition is not perfect. ”

According to Dix's point of view, even with Java skills and SQL tools, there is no guarantee that you will be able to play around with Hadoop. A senior Java developer may not be able to deploy Google's proposed MapReduce model (usually associated with Hadoop). MapReduce extracts data that Hadoop has processed and gets a more accessible subset of data.

MapReduce: A handy Hadoop product

"The problem most Java developers face is how they make data processing conform to the MapReduce paradigm," Dix said. They have to learn how to write MapReduce code that corresponds to Hadoop. They must learn to understand the structure of the problem correctly. ”

Dix, a former Google intern software engineer, points out that the first task for many Google developers is to learn how to write MapReduce code that corresponds to Hadoop, and to calculate the number of words in a text file.

For the Hadoop development team, it is critical to find a good starting point. A reference development startup project is to save the log files to the Hadoop cluster and then use MapReduce to process the resulting data, such as accessing specific guest data for a Web page, response time, or the number of errors thrown by the Web application.

Easy implementation of Hadoop

MapReduce is not the only way. "There are many ways to implement Hadoop directly without having to write a mapreduce program," said Paul Mackles, senior software architecture manager at Adobe. "He refers to hive, but Hive also has some complexity.

Hive converts class SQL statements into MapReduce programs, but because of other query systems, the conversion process usually requires a lot of tuning. Mackles at the Tdwi BI Executive Summit 2013 Conference this month, the data Union "is not its forte". Mackles points out that there are also problems with performance due to other reasons, because hive only supports batching, and running with MapReduce increases startup overhead when tasks are processed and increases the load of subsequent processing when the task executes.

Hadoop also has some inherent ways to operate. The Hadoop file system is not compatible with POSIX, so the method of mounting the file system is not well known. He points out that Hadoop divides files into manageable units to support parallel processing, but development or operations team members sometimes have to "think about how to split and compress files."

Mackles points out that if you want to start learning Hadoop, it's best to pick something simpler and easier to accomplish. Mackles Some parking files, hive aggregations, and experiments as "easy use cases" for Hadoop, while pointing out that operational applications that require fast performance are not suitable for the first attempt.

Mackles lists a number of practical tools for Hadoop, including yarn, Impala, and Hcatalog. In Mackles's words, "they represent a major shift in real time." "These tools are helpful, but it's also difficult to learn so much about the new Hadoop tools in practice."

Original link: http://www.searchbi.com.cn/showcontent_71711.htm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.