Our demand is to count the number of occurrences of each word in a file after the IK participle, and then to sort by descending the number of occurrences. That is, high-frequency word statistics.
because Hadoop cannot do anything with the result after reduce, it can only be divided into two jobs, the first job count, and the second job to sort the results of the first job. The first job is the simplest example of Hadoop countwords, I would say is to use Hadoop to sort the results. Suppose the result of the first job output is as follows:
to do is to follow the number of occurrences of each word in descending order.
********************************** Split Line ***************************************** This problem may occur first:
1. It is possible that the previous job was more than reduce, which would produce multiple result files, because a reduce produces a result file, which is stored in a file similar to part-r-00 in the previous job output directory.
2. The content of the files that need to be sorted is large, so multiple reduce situations need to be considered.
********************************* Split Line ******************************* How to design MapReduce
1. When you read the text in the map phase, and then call the map method, the result of the previous job is reversed, which is the result of the map.
5 a
4 B
C
................
.........................
4 F2. After the map, Hadoop groups The results, and the result becomes
(5:A)
(4:b,f)
(74:C)
3. Then customize the partitioning function according to the size of the reduce number, so that the results form multiple intervals, such as I think that greater than 50 should be in an interval, a total of 3 reduce, then the final data should be three intervals, greater than 50 directly to the first partition 0, From 25 to 50 to the second partition 1, less than 25 is divided into the third Partition 2. Because the partitions and reduce numbers are the same, different partitions correspond to different reduce because the partitions start at 0, and the partition 0 is divided into the first reduce treatment, Partitions that are 1 will be divided into the 2nd reduce process, and so on. and reduce corresponds to the output file, so the file that the first reduce generates is part-r-0000, and the second reduce corresponds to the part-r-0001, and so on, Therefore, reduce processing requires only the key and value to be inverted directly output. This will eventually make the largest number of strings in the first generation file, the sequence will be the order of the file. The code is as follows:
******************************* Split Line *****************************************map:
/**
* to the last mapreduce the result of the key and value reversed, after the tune can be sorted by key.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.