Preface
In my understanding, mapreduce has always been a patent for java and other languages. In terms of python and pypy performance limitations, I never thought about using python to write distributed tasks, at most, multiple workers perform this task from the message queue, but the last thing has really overturned my understanding of python.
Let's talk about the cause first.
One day I shared sed and awk, and the leaders suddenly wanted me to use the actual work of some consultants to obtain the desired data from our large amount of data as examples. there is such a requirement (I remove some obscure professional terms and use actual content to express them ):
Requirement
1. There are a large number of gz compressed files, find the data for one or two days, each row is an actual data
2. Decompress each file and traverse each row to find the rows of column 21st separated by commas (,) and column 16233 separated by column 23rd, and column 27188 as the 2nd column to calculate the number of matched keys.
3. calculate the number of keys that match the value in all statistical results, for example, {'a': 2, 'B': 1, 'C': 1}. The result is, 2: 2}, that is, two of the two, only one
Analysis
First, I really want to use awk. But there are several difficulties when I talk to other colleagues:
1. The total amount of data in two days is more than GB, and the hash result must be retained twice with awk-The awk cannot be used.
2. Using python, according to colleagues' experience: it takes more than a day to extract these small files and read nothing.
3. Data has not been put into hadoop, and there are no other better and faster methods
Solution:
I initially wanted to do this:
Put the compressed files to be processed into the queue
Start the multi-process output queue to obtain and execute the files to be processed, and place the matching results to the shared variables.
After calculation, the third result is generated from the shared variable or data.
But today we are talking about python mapreduce, which is my subsequent version. It is derived from the Implementing MapReduce of the great Doug Hellmann with multiprocessing.
#! /Usr/bin/env python
# Coding = UTF-8
# Python mapreduce running implementation
# Author: Dongweiming
Import gzip
Import time
Import OS
Import glob
Import collections
Import itertools
Import operator
Import multiprocessing
Class AdMapReduce (object ):
Def _ init _ (self, map_func, reduce_func, num_workers = None ):
'''
Num_workers: specifies the number of available cpu cores by default.
Map_func: map function: the returned format is similar to [(a, 1), (B, 3)].
Performance_func: reduce function: the returned format is required to be similar to (c, 10)
'''
Self. map_func = map_func
Self. Performance_func = performance_func
Self. pool = multiprocessing. Pool (num_workers)
Def partition (self, mapped_values ):
Partitioned_data = collections. defaultdict (list)
For key, value in mapped_values:
Partitioned_data [key]. append (value)
Return partitioned_data.items ()
Def _ call _ (self, inputs, chunksize = 1 ):
'''Triggered when the class is called '''
# In fact, they all use the multiprocessing. Pool. map function. inputs is a list to be processed. Think about the map function.
# Chunksize indicates the amount given to mapper each time, which adjusts the efficiency as needed
Map_responses = self. pool. map (self. map_func, inputs, chunksize = chunksize)
# Itertools. chain is to link the mapper result to an iteratable object.
Partitioned_data = self. partition (itertools. chain (* map_responses ))
# [(A, [1, 2]), (B, [2, 3]), the number in the list is the number of times that match the time, and reduce is the sum of items in the list.
Incluced_values = self. pool. map (self. Performance_func, partitioned_data)
Return reduced_values
Def mapper_match (one_file ):
'''The first map function obtains the matched entries from each file '''
Output = []
For line in gzip. open (one_file). readlines ():
L = line. rstrip (). split (',')
If int (l [20]) = 16309 and int (l [22]) = 2656:
Cookie = l [1]
Output. append (cookie, 1 ))
Return output
Def performance_match (item ):
'''Calculate the same key using the reduce function for the first time '''
Cookie, occurances = item
Return (cookie, sum (occurances ))
Def mapper_count (item ):
'''The second er function is actually to make the total number of keys as keys, but the value is marked with 1 '''
_, Count = item
Return [(count, 1)]
Def performance_count (item ):
'''Second reduce function '''
Freq, occurances = item
Return (freq, sum (occurances ))
If _ name _ = '_ main __':
Start = time. time ()
Input_files = glob. glob ('/datacenter/input/2013-12-1 [01]/*')
Mapper = AdMapReduce (mapper_match, reduce_match)
Cookie_feq = mapper (input_files)
Mapper = AdMapReduce (mapper_count, reduce_count)
Cookie_feq = mapper (cookie_feq)
Cookie_feq.sort (key = operator. itemgetter (1 ))
For freq, count in cookie_feq:
Print '{0} t {1} t {2}'. format (freq, count, freq * count)
# Cookie_feq.reverse ()
End = time. time ()
Print 'cost: ', end-start
Remarks
Wow, it's so elegant to watch python do mapreduce. I ran it with pypy and it took only 61 minutes ....
But in fact, he only uses the mapreduce idea and the multi-core hardware base. In fact, the pool still performs file-level processing. If it is a small number of large files, it may not have such a good effect.
I think this kind of work can be handed over to the Admapreduce class many times.