Use python to join data sets in Hadoop

Source: Internet
Author: User
Tags hadoop fs
Introduction to steaming of hadoop there is a tool named steaming that supports python, shell, C ++, PHP, and other languages that support stdin input and stdout output, the running principle can be illustrated by comparing it with the map-reduce program of standard java: using the native java language to implement the Map-reduce program hadoop to prepare data

Introduction to steaming of hadoop there is a tool named steaming that supports python, shell, C ++, PHP, and other languages that support stdin input and stdout output, the running principle can be illustrated by comparing it with the map-reduce program of standard java: using the native java language to implement the Map-reduce program hadoop to prepare data

Introduction to steaming of hadoop

Hadoop has a tool called steaming that supports python, shell, C ++, PHP, and other languages that support stdin input and stdout output, the running principle can be illustrated by comparing it with the map-reduce program of standard java:

Use the native java language to implement the Map-reduce Program
  1. After hadoop prepares the data, it sends the data to the java map program.
  2. After the java map program processes data, It outputs O1
  3. Hadoop splits and sorts O1 and sends it to different reduce machines.
  4. Each reduce machine transmits data to the reduce program.
  5. Reduce program processes data and outputs final data O2
Implement the Map-reduce program using the python language with hadoop streaming
  1. After hadoop prepares the data, it sends the data to the java map program.
  2. The java map program processes data into "key/value" pairs and sends them to the python map program.
  3. After processing data, the python map program returns the result to the java map program.
  4. Java map program outputs data as O1
  5. Hadoop splits and sorts O1 and sends it to different reduce machines.
  6. Each reduce machine processes incoming data into "key/value" pairs and sends them to the reduce program of python.
  7. After processing data, the reduce program in python returns the result to the reduce program in java.
  8. Java reduce program processes data and outputs final data O2

The red color indicates the comparison of map and the blue color indicates the comparison of reduce. It can be seen that the streaming program has one more intermediate processing step. In this way, the efficiency and performance of the steaming program should be lower than that of the java version, however, the development efficiency and Running Performance of python are sometimes higher than those of java, which is the advantage of streaming.

Hadoop needs to implement join in a set

Hadoop is used for data analysis, and most operations are performed on the set. Therefore, it is very common to join the set in this process so that one set can obtain the information corresponding to the other set.

For example, the following requirement has two data copies: Student Information (student ID, name) and Student Score (student ID, course, and score), which are characterized by a common primary key "student ID ", now you need to combine the two to get data (student ID, name, course, score), calculation formula:

(Student ID, name) join (student ID, course, score) = (student ID, name, course, score)

Data Example 1-student information:

Student ID sno Name
01 Name1
02 Name2
03 Name3
04 Name4

Data Example 2:-Student Score:

Student ID sno Course No. courseno Grade
01 01 80
01 02 90
02 01 82
02 02 95

Expected final output:

Student ID sno Name Course courseno Grade
01 Name1 01 80
01 Name1 02 90
02 Name2 01 82
02 Name2 02 95
Considerations for implementing join

If you want to write a sound and robust map reduce program, I suggest you first figure out the format of the input data and the format of the output data, and then manually construct the input data and manually calculate the output data, in this process, you will find some special points in writing programs:

  1. Which is the key implementing the join operation? Is it one field or two fields? In this example, the key is sno and one field.
  2. Indicates whether keys in each set can be repeated. In this example, data 1 cannot be repeated, and data 2 can be duplicate keys.
  3. Check whether the corresponding value of the key in each set does not exist. In this example, the student union has no score, so the key of Data 2 can be empty.

1st will affect the key. fields and partition configurations in the hadoop STARTUP script, 2nd will affect the specific code implementation method in the map-reduce program, and 3rd will also affect the code writing method.

How hadoop implements join Operations

The specific idea is to add a digital label to each data source, so that hadoop sorts the data of the same field together and sorts the data according to the label, therefore, the data of adjacent identical keys is directly merged and output.

1. map stage: Add tags to tables 1 and 2. In fact, multiple fields are output. For example, if one table is marked as 0, table 2 is marked as 2;

2. partion stage: sort and partition data based on the student ID key as the first primary key and label the label as the second primary key.

3. reduce stage: since the First and Second Primary keys have been sorted in order, the adjacent same key data is merged and Output

Hadoop uses python to implement join map and reduce code

Mapper. py code:

#-*-Coding: UTF-8-*-# Mapper. py # from crazy ant www. crazyant. netimport osimport sys # mapper script def mapper (): # obtain the name of the file currently being processed. Here we have two input files # So we need to distinguish filepath = OS. environ ["map_input_file"] filename = OS. path. split (filepath) [-1] for line in sys. stdin: if line. strip () = "": continuefields = line [:-1]. split ("\ t") sno = fields [0] # the purpose of determining the filename below is that different files have different fields, in addition, you need to add different tags if filename = 'data _ info ': name = fields [1] # The following number '0' is the unified flag print '\ t' for data source 1 '. join (sno, '0', name) elif filename = 'data _ case ': courseno = fields [1] grade = fields [2] # The following number '1' is the unified flag print '\ t' for data source 1 '. join (sno, '1', courseno, grade) if _ name __= = '_ main _': mapper ()

CER code:

#-*-Coding: UTF-8-*-# reducer. py # from crazy ant www. crazyant. netimport sysdef reducer (): # to record the difference from the previous record, use lastsno to record the previous snolastsno = "" for line in sys. stdin: if line. strip () = "": continuefields = line [:-1]. split ("\ t") sno = fields [0] ''' processing logic: when the current key is different from the previous key and the label is 0, the name value is recorded, if the current key is the same as the previous key and label = 1, the name of the previous record of the data in this section is output together with the final result ''' if sno! = Lastsno: name = "" # label = 1 is not determined here, # Because sno! = Lastno, and label = 1 indicates that the key does not have data from data source 1 if fields [1] = "0": name = fields [2] elif sno = lastno: # label = 0 is not determined here, # Because sno = lastno and label = 0 indicates that the key does not have data from data source 2. if fields [2] = "1 ": courseno = fields [2] grade = fields [3] if name: print '\ t '. join (lastsno, name, courseno, grade) lastsno = snoif _ name __= = '_ main _': CER ()

To start a hadoop program using a shell script:

# Delete the output directory first ~ /Hadoop-client/hadoop/bin/hadoop fs-rmr/hdfs/jointest/output # from crazy ant www.crazyant.net # note that the Environment Values in the following configuration are different for each machine ~ /Hadoop-client/hadoop/bin/hadoop streaming \-D mapred. map. tasks = 10 \-D mapred. reduce. tasks = 5 \-D mapred. job. map. capacity = 10 \-D mapred. job. reduce. capacity = 5 \-D mapred. job. name = "join -- sno_name-sno_courseno_grade" \-D num. key. fields. for. partition = 1 \-D stream. num. map. output. key. fields = 2 \-partitioner org. apache. hadoop. mapred. lib. keyFieldBasedPartitioner \-input "/hdfs/jointest/input/*" \-output "/hd Fs/jointest/output "\-mapper" python26/bin/python26.sh mapper. py "\-reducer" python26/bin/python26.sh reducer. py "\-file" mapper. py "\-file" CER Cer. py "\-cacheArchive"/share/python26.tar.gz # python26 "# Check if the operation is successful. If the output is 0, echo $?

You can manually construct input and output data for testing. This program has been verified.

More notes

Hadoop's join operations can be divided into many types. Different types of scripts are written in different ways. They are classified by the number of key fields, the number of value fields, and whether keys can be repeat, the following is a summary table, indicating the impact:

Impact Type Impact Scope
Key field count 1. Configure num. key. fields. for. partition in the startup script 2. Configure stream. num. map. output. key. fields in the startup script

3. key acquisition in map and reduce scripts

4. Check whether the key of each data comparison method in the map and reduce scripts can be repeated. If data source 1 can be repeated, mark it as M; data source 2 can be repeatedly marked as N, then join can be divided into: 1*1, M * 1, M * N type

1 * type 1: first record the first value in reduce, and then merge the output directly in the next one;

M * 1 type: Type 1 is used as the output with a small tag, and value is recorded every time the label is set to 1. The final result is output every time the label is set to 2;

M * N type: If type 1 is met, the value is recorded in an array. If type label = 2 is met, the array value of the record is output together with the value of this row. The number of value fields affects the number of data records each time when label = 1. You need to record the value.

The original text link must be reprinted!

Original article address: Hadoop uses python to implement join operations between data sets. Thanks to the original author for sharing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.