After replying to a mapreduce question

Source: Internet
Author: User

I received an email from a friend who looked at my blog around noon. Recently, he was studying mapreduce and wanted to use hadoop to do some work, but encountered some problems, I have also posted some of his questions here, and I feel that I have shared some of my views. Of course, I only have some ideas and may be helpful to new students.

 

Problem:

  1. From the perspective of map (K, v), can mapreduce only perform statistics?
  2. At present, I want to perform a full-text search function in addition to functions such as log analysis, similar to Windows query, you can use keywords to query the location of a file (or sort the file by matching degree). I am confused and don't know how to get started.
  3. Your practice is a standalone mode. If you have uploaded a 1 GB log to HDFS, the splitting is completed, you only need to get the basic information of the file and the location of the block from the client. How can mapreduce proceed?

Body of the email I replied:

First of all, the concept of mapreduce and the architecture of hadoop mapreduce are not a concept. Specifically, the architecture design of hadoop is only the implementation of a subset concept of mapreduce. Everyone can implement business processing based on their own understanding of mapreduce. In short, multi-thread processing is the simplest implementation of mapreduce. in complexity, multi-machine coordination is a complicated implementation.

Which of the following are the most useful examples of mapreduce:

A. Divide and conquer problems. (Find the Key Path of the process and optimize the work that can be processed in parallel)

B. Calculate close data. (This is also the most important feature of HDFS. The transfer of computing is usually cheaper than the transfer of data, especially the processing of massive data)

C. The scale of Data decreases with parallel processing by an order of magnitude.

 

The rest is the processing of non-operational requirements by various frameworks, such as disaster tolerance and how to minimize data coordination.

Three questions raised by him:

1. hadoop mapreduce is most suitable for statistical analysis and computing in terms of architecture. To do other work, you need to consider whether it is suitable, instead of for technology. You need to select a technology first.
2. You can simply use the search technology to implement this requirement. You do not have to stick it to mapreduce.
3. Do you have to deal with massive data on hdsf, or simply perform physical or logical data cutting for Direct Processing Based on your business scenario. HDFS features file cutting, Disaster Tolerance, data logical storage and physical storage independence (easy to expand and manage, and it is also a technical guarantee for computing close to data ).

 

Whether to use the mapreduce framework or not, the key to HDFS storage depends on whether you really need it. When the existing framework is not suitable for you, you can customize mapreduce processing for small-scale problems, the most simplified approach is to handle problems with multiple threads or multiple processes. The requirement determines the technical selection.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.