Run distributed task mapreduce in python

Source: Internet
Author: User
Tags glob message queue in python

Preface

In my understanding, mapreduce has always been a patent for java and other languages. In terms of python and pypy performance limitations, I never thought about using python to write distributed tasks, at most, multiple workers perform this task from the message queue, but the last thing has really overturned my understanding of python.

Let's talk about the cause first.

One day I shared sed and awk, and the leaders suddenly wanted me to use the actual work of some consultants to obtain the desired data from our large amount of data as examples. there is such a requirement (I remove some obscure professional terms and use actual content to express them ):

Requirement


1. There are a large number of gz compressed files, find the data for one or two days, each row is an actual data
2. Decompress each file and traverse each row to find the rows of column 21st separated by commas (,) and column 16233 separated by column 23rd, and column 27188 as the 2nd column to calculate the number of matched keys.
3. calculate the number of keys that match the value in all statistical results, for example, {'a': 2, 'B': 1, 'C': 1}. The result is, 2: 2}, that is, two of the two, only one

Analysis

First, I really want to use awk. But there are several difficulties when I talk to other colleagues:

1. The total amount of data in two days is more than GB, and the hash result must be retained twice with awk-The awk cannot be used.
2. Using python, according to colleagues' experience: it takes more than a day to extract these small files and read nothing.
3. Data has not been put into hadoop, and there are no other better and faster methods

Solution:

I initially wanted to do this:
Put the compressed files to be processed into the queue
Start the multi-process output queue to obtain and execute the files to be processed, and place the matching results to the shared variables.
After calculation, the third result is generated from the shared variable or data.

But today we are talking about python mapreduce, which is my subsequent version. It is derived from the Implementing MapReduce of the great Doug Hellmann with multiprocessing.

#! /Usr/bin/env python
# Coding = UTF-8
# Python mapreduce running implementation
# Author: Dongweiming
Import gzip
Import time
Import OS
Import glob
Import collections
Import itertools
Import operator
Import multiprocessing


Class AdMapReduce (object ):

Def _ init _ (self, map_func, reduce_func, num_workers = None ):
'''
Num_workers: specifies the number of available cpu cores by default.
Map_func: map function: the returned format is similar to [(a, 1), (B, 3)].
Performance_func: reduce function: the returned format is required to be similar to (c, 10)
'''
Self. map_func = map_func
Self. Performance_func = performance_func
Self. pool = multiprocessing. Pool (num_workers)

Def partition (self, mapped_values ):
Partitioned_data = collections. defaultdict (list)
For key, value in mapped_values:
Partitioned_data [key]. append (value)
Return partitioned_data.items ()

Def _ call _ (self, inputs, chunksize = 1 ):
'''Triggered when the class is called '''
# In fact, they all use the multiprocessing. Pool. map function. inputs is a list to be processed. Think about the map function.
# Chunksize indicates the amount given to mapper each time, which adjusts the efficiency as needed
Map_responses = self. pool. map (self. map_func, inputs, chunksize = chunksize)
# Itertools. chain is to link the mapper result to an iteratable object.
Partitioned_data = self. partition (itertools. chain (* map_responses ))
# [(A, [1, 2]), (B, [2, 3]), the number in the list is the number of times that match the time, and reduce is the sum of items in the list.
Incluced_values = self. pool. map (self. Performance_func, partitioned_data)
Return reduced_values


Def mapper_match (one_file ):
'''The first map function obtains the matched entries from each file '''
Output = []
For line in gzip. open (one_file). readlines ():
L = line. rstrip (). split (',')
If int (l [20]) = 16309 and int (l [22]) = 2656:
Cookie = l [1]
Output. append (cookie, 1 ))
Return output


Def performance_match (item ):
'''Calculate the same key using the reduce function for the first time '''
Cookie, occurances = item
Return (cookie, sum (occurances ))


Def mapper_count (item ):
'''The second er function is actually to make the total number of keys as keys, but the value is marked with 1 '''
_, Count = item
Return [(count, 1)]


Def performance_count (item ):
'''Second reduce function '''
Freq, occurances = item
Return (freq, sum (occurances ))


If _ name _ = '_ main __':
Start = time. time ()
Input_files = glob. glob ('/datacenter/input/2013-12-1 [01]/*')
Mapper = AdMapReduce (mapper_match, reduce_match)
Cookie_feq = mapper (input_files)
Mapper = AdMapReduce (mapper_count, reduce_count)
Cookie_feq = mapper (cookie_feq)
Cookie_feq.sort (key = operator. itemgetter (1 ))
For freq, count in cookie_feq:
Print '{0} t {1} t {2}'. format (freq, count, freq * count)
# Cookie_feq.reverse ()
End = time. time ()
Print 'cost: ', end-start

Remarks

Wow, it's so elegant to watch python do mapreduce. I ran it with pypy and it took only 61 minutes ....

But in fact, he only uses the mapreduce idea and the multi-core hardware base. In fact, the pool still performs file-level processing. If it is a small number of large files, it may not have such a good effect.

I think this kind of work can be handed over to the Admapreduce class many times.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.