Apache Datafu:linkedin Open-source Pig UDF Library

Source: Internet
Author: User
Tags mathematical functions

Introducing Apache Datafu in two parts, this article describes the part of its pig UDF. The code is open source on GitHub (except for the code.) There are also some slides introduction links).
Datafu inside are some of the pig's UDFs. Functions that mainly include these aspects:

Bags, Geo, hash, linkanalysis, random, sampling, sessions, sets, stats, URLs

A package is appropriate for each aspect.

I browsed through all the function source code. In fact, the use of these UDFs, the official documents are introduced, and the source code of the gaze is also very clear. So the use itself is very easy.
From the perspective of implementation, inherit from pig Evalfunc system. We know that in pig's builtin functions, there are algebraic functions (AVG, COUNT, Distinct, TOP). Algebraic mathematical functions (MAX, SUM), Basic mathematical functions (SIN, COS, TAN, Floor, LOG) and so on.

Datafu implements a Simpleevalfunc abstract class that inherits from Evalfunc and wraps Evalfunc. Simplifying the implementation process for simple UDFs (omitting some exception checking, just focus on processing logic).

Through the way of reflection, in the Exec () method to do the parameter null check, the number of checks, and finally passed to the subclass implementation of the call () method, return the results.

The subclass inheritance system for Simpleevalfunc is as follows:



The following is a brief overview of the function functions included in each package.


Bags

The basic operation of bag involves append, concat, group, Left-join, split, count, etc.


Geo

Latitude and longitude distance calculation


Hash

Convert the input string to MD5 and Sha


Linkanalysis

The realization of a PageRank


Random

There is only one randint. Enter two values to output a random value between two values


Sampling

Simplerandomsample and Reservoirsample, the reservoir of the latter is a priorityqueue, and the remaining is scoredtuple. The former difference lies in the unbounded and bounded sample results.


Sessions

Press a time window to group


Sets

The difference between set, intersection, and.

The tuples inside the bag must be orderly.


Stats

Statistics related methods:

Two ways to calculate quantile. One is flow-type. Quantile includes the median.

Variance.


URLs

Used to differentiate between the user agent source (PC or phone). What system of mobile phone)




Complete the full text:)

Apache Datafu:linkedin Open-source Pig UDF Library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.