In the previous article, I briefly talked about HDFS. In simple terms, HDFS is a big brother called "namenode". With a group of younger siblings called "datanode", HDFS has completed the storage of a pile of data, the eldest brother is responsible for the directory for storing data, while the younger brother is responsible for the real storage of data. The eldest brother and the younger brother are actually a computer, and they are interconnected through switches.
In fact, this eldest brother and this group of younger siblings can not only store data, but also complete a lot of computing tasks, so they have a new name called jobtracker, while younger siblings are called tasktracker ", mapreduce. Let's talk about mapreduce today.
This is just a big introduction, so that everyone can have a general understanding. After a general understanding, individual details can be easily understood by looking at other materials, it's just a matter of time.
When I first started learning mapreduce, I will be confused by various concepts. What are the differences between Nima tasks, jobs, jobs, and tasks? What are the differences between Split, data sharding, data block, and block? Is the map, Mapper, and map methods the same thing? What is the key and value input by map? Is it a row of data or a row of data? What exactly is Nima input of reduce? What is a data process between them? What other sort, merge, and shuffle appear? I got it !!!
If you have such doubts, let me take it easy. Remember two points: 1. mapreduce is a framework, so it is very simple. I had this idea in my mind before. 2. With this idea, don't be grumpy. Let's take a look.
To clarify the problem of mapreduce, I will explain it step by counting the number of typical words. Let me explain step by step:
1. What are we doing?
There is a text file with many words in it. What is the size of the file? No matter how big it is, there are a lot of lines. What we need to do is to count the number of times each word appears in the file and finally output the result to the file. To put it simply, follow these steps:
Input: A text file with many words.
For example, the file test.txt contains the following content: Hello worldhello hadoop... Hello doghello worldhello jobs
Output:A file that displays the number of times a word appears.
For example, the statistical result is: Hello 29 world 300 hadoop 34 jobs 1 ....
2. Programming
For the above, we write a program named mywordcount. We submit the program to mapreduce and let our eldest brother and younger brother do it. We call it a job, and the English name is job.
3. What does the program do?
The key point is coming.
1. File Segmentation
A big data file, test.txt, is first input and split into one piece, called split. For convenience, we suppose we divide the split into five shards: split1 ~ 5. To put it bluntly, we can regard it as dividing test.txt into five small files: split1 ~ 5. Each split contains many rows of data.(For details about input files, see inputsplit. you can set them. In this case, the content of the test.txt file is divided from top to bottom into five copies ).The next step is to perform word statistics on these five splits, which is called Distributed operations. Each split serves as the input data and provides a map. Therefore, it is called a map task. You can also call it Mapper. in programming, mapper is a class and inherited.
Therefore, a file is divided into split1 and split1 ~ Five data shards, each of which corresponds to one map task. A total of five map tasks are: map1 ~ 5. Who are going to do these five tasks? Say jobtracker, the younger brother, "tasktracker. If five younger siblings have one map task, but three younger siblings have one more task.
(In actual operation, each younger brother has about 10 to 100 maps. If the CPU consumption is small, the eldest brother may allocate about 300 maps to the younger brother)
2. Map operation
Let's pull the lens into a split map process, which is assumed to be the map1 process of split1.
Split1 has a lot of rows of data, and the whole operation is performed on the map task. How can I operate a map task? It is actually a Mapper class in the program. Therefore, the map method in the Mapper class is used to operate the input file. The problem arises: What is the input key and value of the map method? Is it all data of split1 or a row of data. The answer is:A row of data.That is how to process and complete so many rows of data. The answer is:Run the map method multiple times.
Therefore, to sum up, split1 has a lot of rows of data, which is processed by the map1 task. For each row of data, run the map method once.
If split1 has three rows: Row 1: Hello world row 2: Hello hadoop Row 3: Hello hadoop.
After the map1 task runs the map method three times,
For the first map method:
The Input key is 1, and the value is Hello World (where the key value is compiled by me, and the value is correct ). After running the code in the map method
Output: Line 1: Hello 1, line 2: World 1
For the second running of the map method
The Input key is 12 and the value is Hello hadoop. After running the map method,
Output: Row 1: Hello 1, Row 2: hadoop 1
The third running Map Method
The Input key is 23 and the value is Hello hadoop. After running the map method,
Output: Line 1: Hello 1, hadoop 1
Finally, split1, after a map1 task running many map methods, the final output result may be as follows:
hello 1world 1hello 1hadoop 1hello 1hadoop 1
Finally, five split1 ~ 5 and 5 map1 ~ 5. Five results are output and there are different nodes, which exist as intermediate files. They may not know where they are.
Next, we should merge the reduce operation and finally calculate the result. But before reduce, after map, we have done a lot of things. Let's write it later. This article is too long.
Understanding cloud computing mapreduce from another perspective