Ruby and pig process streaming file instances and rubypig process instances
It is easier to use scripts to process data cleansing steps in big data operations. The following describes how to call ruby scripts to process data after pig loads hdfs files, then return a simple case of data stream processing in pig.
Note: The wukong gem package is used for ruby stream processing. Download the package:
Https://github.com/mrflip/wukong
Loading distributed files in pig calls ruby stream processing:
Copy codeThe Code is as follows:
Log = load '$ INFILE' using PigStorage ('\ t ');
Define tracking_parser '/usr/ruby parse_click.rb -- map 'ship ('parse _ click. rb', 'click _ tracking. rb ');
Strmo = stream log through tra_parser;
Store strmo into '$ outfile' using PigStorage (' \ t ');
Copy codeThe Code is as follows:
Require 'wukong'
Require 'json'
Require './click_tra.rb'
Module ParseClick
Class Mapper <Wukong: Streamer: RecordStreamer
Def before_stream
@ Bad_count = 0
End
Def after_stream
Raise RuntimeError, "Exceeded bad records: # {@ bad_count}" if @ bad_count> 10
End
Def process * records
Yield ClickTra. new (JSON. parse (records [2]). to_a
Rescue => e
@ Bad_count + = 1
Warn "Bad record # {e }:# {records [2]}"
End
End
End
Wukong. run ParseClick: Mapper, nil
Copy codeThe Code is as follows:
Require 'date'
Require './models. rb'
Class ClickTra
Output: ip
Output: c_date
# Output your other atrributes
Def c_date
Click_date.strftime ("% Y % m % d"). to_ I
End
Def ip
Browser_ip.to_ I
End
End
Where
Strmo = stream log through tra_parser; call the defined external program tra_parser to process the log object.
Wukong. run ParseClick: Mapper. After nil is executed, callback the ruby execution result to pig for receiving.
Store strmo into '$ outfile' using PigStorage (' \ t'); results are stored persistently.