Large data operation involves data cleaning step or script processing is more convenient, the following introduction pig loading HDFs file calls Ruby script processing data, and then return data flow to the pig processing a simple case.
Note: Ruby streaming is used to wukong this gem package, related downloads:
Https://github.com/mrflip/wukong
Pig load Distributed file Invoke Ruby streaming:
Copy Code code as follows:
Log = Load ' $INFILE ' using pigstorage (' t ');
Define Tracking_parser '/usr/ruby parse_click.rb--map ' ship (' parse_click.rb ', ' click_tracking.rb ');
Strmo = stream log through Tra_parser;
Store Strmo into ' $OUTFILE ' using pigstorage (' t ');
Copy Code code as follows:
Require ' Wukong '
Require ' JSON '
Require './click_tra.rb '
Module Parseclick
Class Mapper < Wukong::streamer::recordstreamer
def Before_stream
@bad_count = 0
End
def After_stream
Raise RuntimeError, "exceeded Bad records: #{@bad_count}" if @bad_count > 10
End
def process *records
Yield Clicktra.new (Json.parse (records[2)). to_a
Rescue => E
@bad_count + 1
Warn "bad record #{e}: #{records[2]}"
End
End
End
Wukong.run Parseclick::mapper, Nil
Copy Code code as follows:
Require ' date '
Require './models.rb '
Class Clicktra
Output:ip
Output:c_date
#output your other atrributes
def c_date
Click_date.strftime ("%y%m%d"). To_i
End
def IP
Browser_ip.to_i
End
End
which
Strmo = stream log through Tra_parser; call the defined external program Tra_parser process the log object.
Wukong.run Parseclick::mapper, nil after execution, the ruby execution result callback pig receive.
Store Strmo into ' $OUTFILE ' using pigstorage (' \ t '), making the result store persistent.