Hadoop-python realizes Hadoop streaming grouping and two-order __python

Source: Internet
Author: User
Tags stdin hadoop fs
grouping (partition)

The Hadoop streaming framework defaults to '/t ' as the key and the remainder as value, using '/t ' as the delimiter,
If there is no '/t ' separator, the entire row is key; the key/tvalue pair is also used as the input for reduce in the map.
-D stream.map.output.field.separator Specifies the split key separator, which defaults to/t
-D stream.num.map.output.key.fields Select key Range
-D map.output.key.field.separator Specifies the separator inside the key
-D num.key.fields.for.partition Specifies the first few parts of the key to be partition instead of the entire key
Preparing Data

Lu V73930, Lu, 549
Black ML1711, Black, 235
Lu V75066, Lu, 657
Gui J73031, GUI, 900
Jin M42387, Jin, 432
Gui J73138, GUI, 456
Jin M41665, Jin, 879
Jin M42529, Jin, 790 step_run.sh

#!/bin/bash
exec_path=$ (dirname "$")
HPHOME=/OPT/CLOUDERA/PARCELS/CDH
jar_package=/opt/cloudera/ Parcels/cdh/lib/hadoop-mapreduce/hadoop-streaming.jar
In_path=/user/h_chenliling/test.txt.lzo
OUT_PATH =/user/h_chenliling/testout.txt
map_file=${exec_path}/step_map.py
red_file=${exec_path}/step_red.py
$HPHOME/bin/hadoop fs-rm-r  $OUT _path
$HPHOME/bin/hadoop jar $JAR _package \
D Mapred.job.queue.name=bdev \
D-stream.map.input.ignorekey=true \
D-map.output.key.field.separator=, \ # Internal Key Separator-
d num.key.fields.for.partition=1 \   #key分组范围
-numreducetasks 2 \
-input $IN _path \
-output $OUT _path \
-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \
-mapper $MAP _file \
-reducer $RED _file \
-file $MAP _file \
-file $RED _file \
-partitioner Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner  #指定分区类
$HPHOME/bin/hadoop fs-ls $OUT _path
step_map.py
#!/usr/bin/env python
#coding =utf-8
import sys for line in
Sys.stdin: Line
    = Line.strip ()
    seq = Line.split (",")
    If Len (seq) >=3:
        plate  = seq[0] #车牌号
        province = seq[1] #注册地
        mile = seq[2] #里程 
  print province+ "," + Plate + "T" + mile
step_red.py
#!/usr/bin/env python
#coding =utf-8
import sys
prov = ""
sum_mile = 0
for line in sys.stdin:< C17/>line = Line.strip ()
    seq = line.split ("\ t")
    mile = Int (seq[1])
    If Prov = "":
        prov = seq[0].split (",") [0]
        Sum_mile = Mile
    else:
        if prov = = Seq[0].split (",") [0]:
            # same group
            sum_mile = sum_mile + mile
        else:< c28/># different groups, output the previous set of data
            print "%s\t%d"% (Prov, sum_mile)
            sum_mile = mile Prov
            = Seq[0].split (",") [0]
Print "%s\t%d"% (Prov, sum_mile)
Output Results:

Hadoop fs-text/user/h_chenliling/testout.txt/part-00000
Jin 2101
Lu 1775
Hadoop fs-text/user/h_chenliling/testout.txt/part-00001
GUI 1356
Black 235 Supplement

In fact, Keyfieldbasepartitioner also has an advanced parameter mapred.text.key.partitioner.options, which can be considered Num.key.fields.for.partition upgrade, it can specify not only the first few fields in the key to be used as partition, but can individually specify a field in the key or a few fields together to do partition.
For example, the above requirements are represented by Mapred.text.key.partitioner.options as mapred.text.key.partitioner.options=-k1,1 two order (secondary sort)

Mapper output is partition to each reducer, and it is sorted in one step. The default is to do two order by key, if key is more than one column, first sorted by first column, the first column of the same, sorted by the second column
If you need custom sorting. Here to control is the key inside what elements to do the sorting basis, is the row dictionary order or digital sequence, reverse or positive sequence. The parameter used to control is mapred.text.key.comparator.options.
The comparison is made by org.apache.hadoop.mapred.lib.KeyFieldBasedComparator the partial fields in the key from the definition. Preparing Data

Lu V73930, Lu, 2,549
Black ML1711, black, 1,235
Lu V75066, Lu, 1,657
Gui J73031, GUI, 1,900
Jin M42387, Jin, 3,432
Gui J73138, GUI, 2,456
Jin M41665, Jin, 2,879
Jin M42529, Jin, 1,790
Lu V75530, Lu, 3,569 step_run.sh

#!/bin/bash exec_path=$ (dirname "$") HPHOME=/OPT/CLOUDERA/PARCELS/CDH jar_package=/opt/cloudera/parcels/cdh/lib/ Hadoop-mapreduce/hadoop-streaming.jar In_path=/user/h_chenliling/test1.txt.lzo OUT_PATH=/user/h_chenliling/ Testout1.txt map_file=${exec_path}/step_map.py red_file=${exec_path}/step_red.py $HPHOME/bin/hadoop fs-rm-r $OUT _ PATH $HPHOME/bin/hadoop jar $JAR _package \ d-mapred.job.queue.name=bdev \ d-stream.map.input.ignorekey=true \ d stream. Map.output.field.separator=, \ #分割key/value-d stream.num.map.output.key.fields=3 \ #取key范围-D map.output.key.field.se Parator=, \ #内部key分割符-d num.key.fields.for.partition=1 \ #取分区范围-D Mapred.output.key.comparator.class=org.apach E.hadoop.mapred.lib.keyfieldbasedcomparator \ #排序类-d mapred.text.key.comparator.options=-k3,3nr \ #第三个元素倒序- Numreducetasks 5 \-input $IN _path \-output $OUT _path \-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat \-ma Pper $MAP _file \-reducer $RED _file \-file $MAP _file \-file $RED _FILE \-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner #分区类 $HPHOME/bin/hadoop fs-ls $OUT _path 
step_map.py
#!/usr/bin/env python
#coding =utf-8
import sys for line in
Sys.stdin: Line
    = Line.strip ()
    seq = Line.split (",")
    If Len (seq) >=3:
        plate  = seq[0] #车牌号
        province = seq[1] #注册地 order
        = seq[2]
        mile = seq[3] #里程
        Print province + ", +plate+", "+ Order  +", "+ Mile
step_red.py
#!/usr/bin/env python
#coding =utf-8
import sys for line in
Sys.stdin: Line
    = Line.strip ()
    Print Line
Output Results

Hadoop fs-text/user/h_chenliling/testout1.txt/part-00000
Lu, Lu v73930,2 549
Lu, Lu v75066,1 657
Black, black ml1711,1 235
Hadoop fs-text/user/h_chenliling/testout1.txt/part-00001
GUI, GUI j73138,2 456
GUI, GUI j73031,1 900
Hadoop fs-text/user/h_chenliling/testout1.txt/part-00002
Jin, Jin m42387,3 432
Jin, Jin m41665,2 879
Jin, Jin m42529,1 790

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.