Run distributed task mapreduce in python

Last Update:2017-01-13 Source: Internet

Author: User

Tags glob message queue in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

In my understanding, mapreduce has always been a patent for java and other languages. In terms of python and pypy performance limitations, I never thought about using python to write distributed tasks, at most, multiple workers perform this task from the message queue, but the last thing has really overturned my understanding of python.

Let's talk about the cause first.

One day I shared sed and awk, and the leaders suddenly wanted me to use the actual work of some consultants to obtain the desired data from our large amount of data as examples. there is such a requirement (I remove some obscure professional terms and use actual content to express them ):

Requirement

1. There are a large number of gz compressed files, find the data for one or two days, each row is an actual data
2. Decompress each file and traverse each row to find the rows of column 21st separated by commas (,) and column 16233 separated by column 23rd, and column 27188 as the 2nd column to calculate the number of matched keys.
3. calculate the number of keys that match the value in all statistical results, for example, {'a': 2, 'B': 1, 'C': 1}. The result is, 2: 2}, that is, two of the two, only one

Analysis

First, I really want to use awk. But there are several difficulties when I talk to other colleagues:

1. The total amount of data in two days is more than GB, and the hash result must be retained twice with awk-The awk cannot be used.
2. Using python, according to colleagues' experience: it takes more than a day to extract these small files and read nothing.
3. Data has not been put into hadoop, and there are no other better and faster methods

Solution:

I initially wanted to do this:
Put the compressed files to be processed into the queue
Start the multi-process output queue to obtain and execute the files to be processed, and place the matching results to the shared variables.
After calculation, the third result is generated from the shared variable or data.

But today we are talking about python mapreduce, which is my subsequent version. It is derived from the Implementing MapReduce of the great Doug Hellmann with multiprocessing.

#! /Usr/bin/env python
# Coding = UTF-8
# Python mapreduce running implementation
# Author: Dongweiming
Import gzip
Import time
Import OS
Import glob
Import collections
Import itertools
Import operator
Import multiprocessing

Class AdMapReduce (object ):

Def _ init _ (self, map_func, reduce_func, num_workers = None ):
'''
Num_workers: specifies the number of available cpu cores by default.
Map_func: map function: the returned format is similar to [(a, 1), (B, 3)].
Performance_func: reduce function: the returned format is required to be similar to (c, 10)
'''
Self. map_func = map_func
Self. Performance_func = performance_func
Self. pool = multiprocessing. Pool (num_workers)

Def partition (self, mapped_values ):
Partitioned_data = collections. defaultdict (list)
For key, value in mapped_values:
Partitioned_data [key]. append (value)
Return partitioned_data.items ()

Def _ call _ (self, inputs, chunksize = 1 ):
'''Triggered when the class is called '''
# In fact, they all use the multiprocessing. Pool. map function. inputs is a list to be processed. Think about the map function.
# Chunksize indicates the amount given to mapper each time, which adjusts the efficiency as needed
Map_responses = self. pool. map (self. map_func, inputs, chunksize = chunksize)
# Itertools. chain is to link the mapper result to an iteratable object.
Partitioned_data = self. partition (itertools. chain (* map_responses ))
# [(A, [1, 2]), (B, [2, 3]), the number in the list is the number of times that match the time, and reduce is the sum of items in the list.
Incluced_values = self. pool. map (self. Performance_func, partitioned_data)
Return reduced_values

Def mapper_match (one_file ):
'''The first map function obtains the matched entries from each file '''
Output = []
For line in gzip. open (one_file). readlines ():
L = line. rstrip (). split (',')
If int (l [20]) = 16309 and int (l [22]) = 2656:
Cookie = l [1]
Output. append (cookie, 1 ))
Return output

Def performance_match (item ):
'''Calculate the same key using the reduce function for the first time '''
Cookie, occurances = item
Return (cookie, sum (occurances ))

Def mapper_count (item ):
'''The second er function is actually to make the total number of keys as keys, but the value is marked with 1 '''
_, Count = item
Return [(count, 1)]

Def performance_count (item ):
'''Second reduce function '''
Freq, occurances = item
Return (freq, sum (occurances ))

If _ name _ = '_ main __':
Start = time. time ()
Input_files = glob. glob ('/datacenter/input/2013-12-1 [01]/*')
Mapper = AdMapReduce (mapper_match, reduce_match)
Cookie_feq = mapper (input_files)
Mapper = AdMapReduce (mapper_count, reduce_count)
Cookie_feq = mapper (cookie_feq)
Cookie_feq.sort (key = operator. itemgetter (1 ))
For freq, count in cookie_feq:
Print '{0} t {1} t {2}'. format (freq, count, freq * count)
# Cookie_feq.reverse ()
End = time. time ()
Print 'cost: ', end-start

Remarks

Wow, it's so elegant to watch python do mapreduce. I ran it with pypy and it took only 61 minutes ....

But in fact, he only uses the mapreduce idea and the multi-core hardware base. In fact, the pool still performs file-level processing. If it is a small number of large files, it may not have such a good effect.

I think this kind of work can be handed over to the Admapreduce class many times.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More