Use python to join data sets in Hadoop

Last Update:2018-06-11 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to steaming of hadoop there is a tool named steaming that supports python, shell, C ++, PHP, and other languages that support stdin input and stdout output, the running principle can be illustrated by comparing it with the map-reduce program of standard java: using the native java language to implement the Map-reduce program hadoop to prepare data

Introduction to steaming of hadoop

Hadoop has a tool called steaming that supports python, shell, C ++, PHP, and other languages that support stdin input and stdout output, the running principle can be illustrated by comparing it with the map-reduce program of standard java:

Use the native java language to implement the Map-reduce Program

After hadoop prepares the data, it sends the data to the java map program.
After the java map program processes data, It outputs O1
Hadoop splits and sorts O1 and sends it to different reduce machines.
Each reduce machine transmits data to the reduce program.
Reduce program processes data and outputs final data O2

Implement the Map-reduce program using the python language with hadoop streaming

After hadoop prepares the data, it sends the data to the java map program.
The java map program processes data into "key/value" pairs and sends them to the python map program.
After processing data, the python map program returns the result to the java map program.
Java map program outputs data as O1
Hadoop splits and sorts O1 and sends it to different reduce machines.
Each reduce machine processes incoming data into "key/value" pairs and sends them to the reduce program of python.
After processing data, the reduce program in python returns the result to the reduce program in java.
Java reduce program processes data and outputs final data O2

The red color indicates the comparison of map and the blue color indicates the comparison of reduce. It can be seen that the streaming program has one more intermediate processing step. In this way, the efficiency and performance of the steaming program should be lower than that of the java version, however, the development efficiency and Running Performance of python are sometimes higher than those of java, which is the advantage of streaming.

Hadoop needs to implement join in a set

Hadoop is used for data analysis, and most operations are performed on the set. Therefore, it is very common to join the set in this process so that one set can obtain the information corresponding to the other set.

For example, the following requirement has two data copies: Student Information (student ID, name) and Student Score (student ID, course, and score), which are characterized by a common primary key "student ID ", now you need to combine the two to get data (student ID, name, course, score), calculation formula:

(Student ID, name) join (student ID, course, score) = (student ID, name, course, score)

Data Example 1-student information:

Student ID sno	Name
01	Name1
02	Name2
03	Name3
04	Name4

Data Example 2:-Student Score:

Student ID sno	Course No. courseno	Grade
01	01	80
01	02	90
02	01	82
02	02	95

Expected final output:

Student ID sno	Name	Course courseno	Grade
01	Name1	01	80
01	Name1	02	90
02	Name2	01	82
02	Name2	02	95

Considerations for implementing join

If you want to write a sound and robust map reduce program, I suggest you first figure out the format of the input data and the format of the output data, and then manually construct the input data and manually calculate the output data, in this process, you will find some special points in writing programs:

Which is the key implementing the join operation? Is it one field or two fields? In this example, the key is sno and one field.
Indicates whether keys in each set can be repeated. In this example, data 1 cannot be repeated, and data 2 can be duplicate keys.
Check whether the corresponding value of the key in each set does not exist. In this example, the student union has no score, so the key of Data 2 can be empty.

1st will affect the key. fields and partition configurations in the hadoop STARTUP script, 2nd will affect the specific code implementation method in the map-reduce program, and 3rd will also affect the code writing method.

How hadoop implements join Operations

The specific idea is to add a digital label to each data source, so that hadoop sorts the data of the same field together and sorts the data according to the label, therefore, the data of adjacent identical keys is directly merged and output.

1. map stage: Add tags to tables 1 and 2. In fact, multiple fields are output. For example, if one table is marked as 0, table 2 is marked as 2;

2. partion stage: sort and partition data based on the student ID key as the first primary key and label the label as the second primary key.

3. reduce stage: since the First and Second Primary keys have been sorted in order, the adjacent same key data is merged and Output

Hadoop uses python to implement join map and reduce code

Mapper. py code:

#-*-Coding: UTF-8-*-# Mapper. py # from crazy ant www. crazyant. netimport osimport sys # mapper script def mapper (): # obtain the name of the file currently being processed. Here we have two input files # So we need to distinguish filepath = OS. environ ["map_input_file"] filename = OS. path. split (filepath) [-1] for line in sys. stdin: if line. strip () = "": continuefields = line [:-1]. split ("\ t") sno = fields [0] # the purpose of determining the filename below is that different files have different fields, in addition, you need to add different tags if filename = 'data _ info ': name = fields [1] # The following number '0' is the unified flag print '\ t' for data source 1 '. join (sno, '0', name) elif filename = 'data _ case ': courseno = fields [1] grade = fields [2] # The following number '1' is the unified flag print '\ t' for data source 1 '. join (sno, '1', courseno, grade) if _ name __= = '_ main _': mapper ()

CER code:

#-*-Coding: UTF-8-*-# reducer. py # from crazy ant www. crazyant. netimport sysdef reducer (): # to record the difference from the previous record, use lastsno to record the previous snolastsno = "" for line in sys. stdin: if line. strip () = "": continuefields = line [:-1]. split ("\ t") sno = fields [0] ''' processing logic: when the current key is different from the previous key and the label is 0, the name value is recorded, if the current key is the same as the previous key and label = 1, the name of the previous record of the data in this section is output together with the final result ''' if sno! = Lastsno: name = "" # label = 1 is not determined here, # Because sno! = Lastno, and label = 1 indicates that the key does not have data from data source 1 if fields [1] = "0": name = fields [2] elif sno = lastno: # label = 0 is not determined here, # Because sno = lastno and label = 0 indicates that the key does not have data from data source 2. if fields [2] = "1 ": courseno = fields [2] grade = fields [3] if name: print '\ t '. join (lastsno, name, courseno, grade) lastsno = snoif _ name __= = '_ main _': CER ()

To start a hadoop program using a shell script:

# Delete the output directory first ~ /Hadoop-client/hadoop/bin/hadoop fs-rmr/hdfs/jointest/output # from crazy ant www.crazyant.net # note that the Environment Values in the following configuration are different for each machine ~ /Hadoop-client/hadoop/bin/hadoop streaming \-D mapred. map. tasks = 10 \-D mapred. reduce. tasks = 5 \-D mapred. job. map. capacity = 10 \-D mapred. job. reduce. capacity = 5 \-D mapred. job. name = "join -- sno_name-sno_courseno_grade" \-D num. key. fields. for. partition = 1 \-D stream. num. map. output. key. fields = 2 \-partitioner org. apache. hadoop. mapred. lib. keyFieldBasedPartitioner \-input "/hdfs/jointest/input/*" \-output "/hd Fs/jointest/output "\-mapper" python26/bin/python26.sh mapper. py "\-reducer" python26/bin/python26.sh reducer. py "\-file" mapper. py "\-file" CER Cer. py "\-cacheArchive"/share/python26.tar.gz # python26 "# Check if the operation is successful. If the output is 0, echo $?

You can manually construct input and output data for testing. This program has been verified.

More notes

Hadoop's join operations can be divided into many types. Different types of scripts are written in different ways. They are classified by the number of key fields, the number of value fields, and whether keys can be repeat, the following is a summary table, indicating the impact:

Impact Type	Impact Scope
Key field count	1. Configure num. key. fields. for. partition in the startup script 2. Configure stream. num. map. output. key. fields in the startup script

3. key acquisition in map and reduce scripts

4. Check whether the key of each data comparison method in the map and reduce scripts can be repeated. If data source 1 can be repeated, mark it as M; data source 2 can be repeatedly marked as N, then join can be divided into: 1*1, M * 1, M * N type

1 * type 1: first record the first value in reduce, and then merge the output directly in the next one;

M * 1 type: Type 1 is used as the output with a small tag, and value is recorded every time the label is set to 1. The final result is output every time the label is set to 2;

M * N type: If type 1 is met, the value is recorded in an array. If type label = 2 is met, the array value of the record is output together with the value of this row. The number of value fields affects the number of data records each time when label = 1. You need to record the value.

The original text link must be reprinted!

Original article address: Hadoop uses python to implement join operations between data sets. Thanks to the original author for sharing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More