Ruby and pig process streaming file instances and rubypig process instances

Source: Internet
Author: User

Ruby and pig process streaming file instances and rubypig process instances

It is easier to use scripts to process data cleansing steps in big data operations. The following describes how to call ruby scripts to process data after pig loads hdfs files, then return a simple case of data stream processing in pig.

Note: The wukong gem package is used for ruby stream processing. Download the package:
Https://github.com/mrflip/wukong

Loading distributed files in pig calls ruby stream processing:
Copy codeThe Code is as follows:
Log = load '$ INFILE' using PigStorage ('\ t ');

Define tracking_parser '/usr/ruby parse_click.rb -- map 'ship ('parse _ click. rb', 'click _ tracking. rb ');

Strmo = stream log through tra_parser;

Store strmo into '$ outfile' using PigStorage (' \ t ');

Copy codeThe Code is as follows:
Require 'wukong'
Require 'json'
Require './click_tra.rb'

Module ParseClick
Class Mapper <Wukong: Streamer: RecordStreamer
Def before_stream
@ Bad_count = 0
End

Def after_stream
Raise RuntimeError, "Exceeded bad records: # {@ bad_count}" if @ bad_count> 10
End

Def process * records
Yield ClickTra. new (JSON. parse (records [2]). to_a
Rescue => e
@ Bad_count + = 1
Warn "Bad record # {e }:# {records [2]}"
End
End
End

Wukong. run ParseClick: Mapper, nil

Copy codeThe Code is as follows:
Require 'date'
Require './models. rb'

Class ClickTra

Output: ip
Output: c_date
# Output your other atrributes

Def c_date
Click_date.strftime ("% Y % m % d"). to_ I
End

Def ip
Browser_ip.to_ I
End

End

Where

Strmo = stream log through tra_parser; call the defined external program tra_parser to process the log object.
Wukong. run ParseClick: Mapper. After nil is executed, callback the ruby execution result to pig for receiving.
Store strmo into '$ outfile' using PigStorage (' \ t'); results are stored persistently.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.