Data Science from Scratch's MapReduce

Source: Internet
Author: User

Mapreduce

MapReduce is a computational model, except that the computational model is in the world of parallel computing.

Consider a simple example-word statistics
from collections import Counterimport redocuments = ["data science", "big data", "science fiction"]def tokenize(message):    message = message.lower()    all_words = re.findall(‘[a-z0-9]+‘,message)    return set(all_words)def word_count_old(documents):    return Counter(word for document in documents            for word in tokenize(document))print word_count_old(documents)

The simplest statistic is this, but if there are hundreds of thousands of such documents, this method is particularly slow, and it is possible that the computer can be overwhelmed by such large data.

First put the code:

def wc_mapper(document):    """for each word in the document,emit (word,1)"""    for word in tokenize(document):        yield (word,1)def wc_reducer(word,counts):    yield (word,sum(counts))def word_count(documents):    collector = defaultdict(list)    for document in documents:        for word,count in wc_mapper(document):            collector[word].append(count)    print collector    return [output for word,counts in collector.iteritems() for output in wc_reducer(word,counts)]print word_count(documents)

This part is difficult to understand, we take a step-by-step look. As an documents = ["data science", "big data", "science fiction"] example,

First, the documents are passed into the Word_count function and a key-value variable is defined, except that the variable key is a word andvalue is a list form . When executing the Wc_mapper function, as long as there is a word, I produce (word,1) something out, so when

    for document in documents:        for word,count in wc_mapper(document):            collector[word].append(count)

After this code is executed, the contents of collector are:

{‘science‘: [1, 1], ‘fiction‘: [1], ‘data‘: [1, 1], ‘big‘: [1]}

Next execute the Wc_reducer function, key does not change, value is counted up. Final output

[(‘science‘, 2), (‘fiction‘, 1), (‘data‘, 2), (‘big‘, 1)]

I didn't think it sounded so simple as the MapReduce algorithm on the tall.

Why is MapReduce chosen?

The MapReduce computational model allows us to distribute computing.

Generalize, we want this model to be easy to use:

def map_reduce(inputs,mapper,reducer):    collector = defaultdict(list)    for input in inputs:        for key,value in mapper(input):            collector[key].append(value)    return [output             for key,values in collector.iteritems()            for output in reducer(key,values)]print map_reduce(documents,wc_mapper,wc_reducer)

However, map and reducer need to change

def reduce_values_using(fn,key,values):    yield (key,fn(values))def values_reducer(fn):#fn为聚合函数    return partial(reduce_values_using,fn)count_distinct_reducer = values_reducer(sum)count_distinct_reducer = values_reducer(lambda values:len(set(values)))print map_reduce(documents,wc_mapper,count_distinct_reducer)
State analysis

Suppose there is a situation where we want to analyze the day of the one week when people talk about data science the most, in order to find this, we just need to find the number of times that the word science appears every day.

def data_science_day_mapper(status_update):    if ‘data science‘ in status_update[‘text‘].lower():        day_of_week = status_update[‘created_at‘].weekday()        yield (day_of_week,1)data_science_days = map_reduce(status_updates,data_science_day_mapper,sum_reducer)print data_science_days
Matrix multiplication

Matrix multiplication, see this article people all go to college, and will be matrix multiplication, to a matrix m*n A, and n*k the Matrix B, the matrix C in the first row of the J column of the element is a matrix I row all elements are multiplied and added to the J column in B. If the matrix is particularly sparse, we can use a tuple (name,i,j,value) to represent the first row of column J in the matrix name, and the value of this element, of course, is non-0. We can design mapreduce to solve this problem.

Looked at this part of the content for 2 hours, finally understand, decline. Or your brain is not good enough ah. Front warning, please fasten your seat belts ...

Let's start by analyzing how the basic matrix multiplication is.

Well, if you don't understand, go and fill in the math.

As we have just said, in big data, matrices are generally sparse, so we consider such matrices.

A=[[3,2,1],    [0,0,0]]B=[[4,-1,0],    [10,0,0],    [0,0,0]]

After multiplying the result is

32 -3 00   0 0

Let's now analyze how to solve with MapReduce.

This is a 2*3 3*3 matrix multiplied by, and the result is obtained 2*3 . For example, the result of (0,0) element 32 is 3*4+2*10+1*0 obtained, and we calculate the entire process, found in a (0,0) This element was calculated 3 times, respectively in the calculation of C (0,0), (0,1), (0,2) used. Similarly, B is the same, except that B is column-based. So the following code is like this:

def matrix_multiply_mapper(m,element):    #element是一个4元组,name是矩阵标志,i,j是坐标,value是值    name,i,j,value = element    if name==‘A‘:        for k in range(m):            # print ((i,k),(j,value))              yield ((i,k),(j,value))   #前面的(i,k)为key,表示在接下来计算结果的时候用到,将参与计算Cik的坐标,j表示参与计算向量的第几个坐标,下面还会讲    else:        for k in range(m):            # print ((k,j),(i,value))            yield ((k,j),(i,value))

Get a map like this and we look at the results

(0, 1): [(0, 3), (1, 2), (0, -1)],(0, 0): [(0, 3),(1, 2), (0, 4), (1, 10)], (2, 1): [(0, -1)], (1, 1): [(0, -1)], (2, 0): [(0, 4),(1, 10)], (1, 0): [(0, 4), (1, 10)], (0, 2): [(0, 3), (1, 2)]})

What is the meaning of such an expression, such as the first line, which indicates that the element in the result matrix (0,1) is computed from a subsequent list, then how to calculate it, which is the work of reduce.

def matrix_multiply_reducer(m,key,indexed_values):    results_by_index = defaultdict(list)    for index,value in indexed_values:        print index,value        results_by_index[index].append(value)    print results_by_index    sum_product = sum(results[0]*results[1]                         for results in results_by_index.values()                            if len(results)==2)    if sum_product != 0.0:        yield (key,sum_product)def map_reduce(inputs,mapper,reducer):    collector = defaultdict(list)    for input in inputs:        for key,value in mapper(input):            collector[key].append(value)    # print collector    return [output             for key,values in collector.iteritems()            for output in reducer(key,values)]  

Take the first element of the calculation, and we get the result.

(0, 1): [(0, 3), (1, 2), (0, -1)],

Indicates that the position element in the result matrix (0,1) is calculated from the subsequent values, and in reduce, the values are

[(0, 3), (1, 2), (0, -1)]

We got this {0: [3, -1], 1: [2]}) , there is no feeling, good, two tuples in front of the key, followed by the value, the expression is the same, and then calculate, that is 3*-1=-3 (0,1) This position, then we have a question, that 2, because 2 only one, the others are not 0, So the results add up to 2.

This part of the content, really very useful, to digest well.

God is really paying attention to details!

Data Science from Scratch's MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.