Hadoop-python realizes Hadoop streaming grouping and two-order _

Hadoop-python realizes Hadoop streaming grouping and two-order __python

Last Update:2018-07-24 Source: Internet

Author: User

Tags stdin hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

grouping (partition)

The Hadoop streaming framework defaults to '/t ' as the key and the remainder as value, using '/t ' as the delimiter,
If there is no '/t ' separator, the entire row is key; the key/tvalue pair is also used as the input for reduce in the map.
-D stream.map.output.field.separator Specifies the split key separator, which defaults to/t
-D stream.num.map.output.key.fields Select key Range
-D map.output.key.field.separator Specifies the separator inside the key
-D num.key.fields.for.partition Specifies the first few parts of the key to be partition instead of the entire key
Preparing Data

Lu V73930, Lu, 549
Black ML1711, Black, 235
Lu V75066, Lu, 657
Gui J73031, GUI, 900
Jin M42387, Jin, 432
Gui J73138, GUI, 456
Jin M41665, Jin, 879
Jin M42529, Jin, 790 step_run.sh

#!/bin/bash
exec_path=$ (dirname "$")
HPHOME=/OPT/CLOUDERA/PARCELS/CDH
jar_package=/opt/cloudera/ Parcels/cdh/lib/hadoop-mapreduce/hadoop-streaming.jar
In_path=/user/h_chenliling/test.txt.lzo
OUT_PATH =/user/h_chenliling/testout.txt
map_file=${exec_path}/step_map.py
red_file=${exec_path}/step_red.py
$HPHOME/bin/hadoop fs-rm-r  $OUT _path
$HPHOME/bin/hadoop jar $JAR _package \
D Mapred.job.queue.name=bdev \
D-stream.map.input.ignorekey=true \
D-map.output.key.field.separator=, \ # Internal Key Separator-
d num.key.fields.for.partition=1 \   #key分组范围
-numreducetasks 2 \
-input $IN _path \
-output $OUT _path \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-mapper $MAP _file \
-reducer $RED _file \
-file $MAP _file \
-file $RED _file \
-partitioner Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner  #指定分区类
$HPHOME/bin/hadoop fs-ls $OUT _path

step_map.py

#!/usr/bin/env python
#coding =utf-8
import sys for line in
Sys.stdin: Line
    = Line.strip ()
    seq = Line.split (",")
    If Len (seq) >=3:
        plate  = seq[0] #车牌号
        province = seq[1] #注册地
        mile = seq[2] #里程 
  print province+ "," + Plate + "T" + mile

step_red.py

#!/usr/bin/env python
#coding =utf-8
import sys
prov = ""
sum_mile = 0
for line in sys.stdin:< C17/>line = Line.strip ()
    seq = line.split ("\ t")
    mile = Int (seq[1])
    If Prov = "":
        prov = seq[0].split (",") [0]
        Sum_mile = Mile
    else:
        if prov = = Seq[0].split (",") [0]:
            # same group
            sum_mile = sum_mile + mile
        else:< c28/># different groups, output the previous set of data
            print "%s\t%d"% (Prov, sum_mile)
            sum_mile = mile Prov
            = Seq[0].split (",") [0]
Print "%s\t%d"% (Prov, sum_mile)

Output Results:

Hadoop fs-text/user/h_chenliling/testout.txt/part-00000
Jin 2101
Lu 1775
Hadoop fs-text/user/h_chenliling/testout.txt/part-00001
GUI 1356
Black 235 Supplement

In fact, Keyfieldbasepartitioner also has an advanced parameter mapred.text.key.partitioner.options, which can be considered Num.key.fields.for.partition upgrade, it can specify not only the first few fields in the key to be used as partition, but can individually specify a field in the key or a few fields together to do partition.
For example, the above requirements are represented by Mapred.text.key.partitioner.options as mapred.text.key.partitioner.options=-k1,1 two order (secondary sort)

Mapper output is partition to each reducer, and it is sorted in one step. The default is to do two order by key, if key is more than one column, first sorted by first column, the first column of the same, sorted by the second column
If you need custom sorting. Here to control is the key inside what elements to do the sorting basis, is the row dictionary order or digital sequence, reverse or positive sequence. The parameter used to control is mapred.text.key.comparator.options.
The comparison is made by org.apache.hadoop.mapred.lib.KeyFieldBasedComparator the partial fields in the key from the definition. Preparing Data

Lu V73930, Lu, 2,549
Black ML1711, black, 1,235
Lu V75066, Lu, 1,657
Gui J73031, GUI, 1,900
Jin M42387, Jin, 3,432
Gui J73138, GUI, 2,456
Jin M41665, Jin, 2,879
Jin M42529, Jin, 1,790
Lu V75530, Lu, 3,569 step_run.sh

#!/bin/bash exec_path=$ (dirname "$") HPHOME=/OPT/CLOUDERA/PARCELS/CDH jar_package=/opt/cloudera/parcels/cdh/lib/ Hadoop-mapreduce/hadoop-streaming.jar In_path=/user/h_chenliling/test1.txt.lzo OUT_PATH=/user/h_chenliling/ Testout1.txt map_file=${exec_path}/step_map.py red_file=${exec_path}/step_red.py $HPHOME/bin/hadoop fs-rm-r $OUT _ PATH $HPHOME/bin/hadoop jar $JAR _package \ d-mapred.job.queue.name=bdev \ d-stream.map.input.ignorekey=true \ d stream. Map.output.field.separator=, \ #分割key/value-d stream.num.map.output.key.fields=3 \ #取key范围-D map.output.key.field.se Parator=, \ #内部key分割符-d num.key.fields.for.partition=1 \ #取分区范围-D Mapred.output.key.comparator.class=org.apach E.hadoop.mapred.lib.keyfieldbasedcomparator \ #排序类-d mapred.text.key.comparator.options=-k3,3nr \ #第三个元素倒序- Numreducetasks 5 \-input $IN _path \-output $OUT _path \-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \-ma Pper $MAP _file \-reducer $RED _file \-file $MAP _file \-file $RED _FILE \-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner #分区类 $HPHOME/bin/hadoop fs-ls $OUT _path

step_map.py

#!/usr/bin/env python
#coding =utf-8
import sys for line in
Sys.stdin: Line
    = Line.strip ()
    seq = Line.split (",")
    If Len (seq) >=3:
        plate  = seq[0] #车牌号
        province = seq[1] #注册地 order
        = seq[2]
        mile = seq[3] #里程
        Print province + ", +plate+", "+ Order  +", "+ Mile

step_red.py

#!/usr/bin/env python
#coding =utf-8
import sys for line in
Sys.stdin: Line
    = Line.strip ()
    Print Line

Output Results

Hadoop fs-text/user/h_chenliling/testout1.txt/part-00000
Lu, Lu v73930,2 549
Lu, Lu v75066,1 657
Black, black ml1711,1 235
Hadoop fs-text/user/h_chenliling/testout1.txt/part-00001
GUI, GUI j73138,2 456
GUI, GUI j73031,1 900
Hadoop fs-text/user/h_chenliling/testout1.txt/part-00002
Jin, Jin m42387,3 432
Jin, Jin m41665,2 879
Jin, Jin m42529,1 790

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More