Ruby and pig processing streaming file instances _ruby topics

Source: Internet
Author: User

Large data operation involves data cleaning step or script processing is more convenient, the following introduction pig loading HDFs file calls Ruby script processing data, and then return data flow to the pig processing a simple case.

Note: Ruby streaming is used to wukong this gem package, related downloads:
Https://github.com/mrflip/wukong

Pig load Distributed file Invoke Ruby streaming:

Copy Code code as follows:

Log = Load ' $INFILE ' using pigstorage (' t ');

Define Tracking_parser '/usr/ruby parse_click.rb--map ' ship (' parse_click.rb ', ' click_tracking.rb ');

Strmo = stream log through Tra_parser;

Store Strmo into ' $OUTFILE ' using pigstorage (' t ');

Copy Code code as follows:

Require ' Wukong '
Require ' JSON '
Require './click_tra.rb '

Module Parseclick
Class Mapper < Wukong::streamer::recordstreamer
def Before_stream
@bad_count = 0
End

def After_stream
Raise RuntimeError, "exceeded Bad records: #{@bad_count}" if @bad_count > 10
End

def process *records
Yield Clicktra.new (Json.parse (records[2)). to_a
Rescue => E
@bad_count + 1
Warn "bad record #{e}: #{records[2]}"
End
End
End

Wukong.run Parseclick::mapper, Nil

Copy Code code as follows:

Require ' date '
Require './models.rb '

Class Clicktra

Output:ip
Output:c_date
#output your other atrributes

def c_date
Click_date.strftime ("%y%m%d"). To_i
End

def IP
Browser_ip.to_i
End

End

which

Strmo = stream log through Tra_parser; call the defined external program Tra_parser process the log object.
Wukong.run Parseclick::mapper, nil after execution, the ruby execution result callback pig receive.
Store Strmo into ' $OUTFILE ' using pigstorage (' \ t '), making the result store persistent.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.