Introducing Apache Datafu in two parts, this article describes the part of its pig UDF. The code is open source on GitHub (except for the code.) There are also some slides introduction links).
Datafu inside are some of the pig's UDFs. Functions that mainly include these aspects:
Bags, Geo, hash, linkanalysis, random, sampling, sessions, sets, stats, URLs
A package is appropriate for each aspect.
I browsed through all the function source code. In fact, the use of these UDFs, the official documents are introduced, and the source code of the gaze is also very clear. So the use itself is very easy.
From the perspective of implementation, inherit from pig Evalfunc system. We know that in pig's builtin functions, there are algebraic functions (AVG, COUNT, Distinct, TOP). Algebraic mathematical functions (MAX, SUM), Basic mathematical functions (SIN, COS, TAN, Floor, LOG) and so on.
Datafu implements a Simpleevalfunc abstract class that inherits from Evalfunc and wraps Evalfunc. Simplifying the implementation process for simple UDFs (omitting some exception checking, just focus on processing logic).
Through the way of reflection, in the Exec () method to do the parameter null check, the number of checks, and finally passed to the subclass implementation of the call () method, return the results.
The subclass inheritance system for Simpleevalfunc is as follows:
The following is a brief overview of the function functions included in each package.
Bags
The basic operation of bag involves append, concat, group, Left-join, split, count, etc.
Geo
Latitude and longitude distance calculation
Hash
Convert the input string to MD5 and Sha
Linkanalysis
The realization of a PageRank
Random
There is only one randint. Enter two values to output a random value between two values
Sampling
Simplerandomsample and Reservoirsample, the reservoir of the latter is a priorityqueue, and the remaining is scoredtuple. The former difference lies in the unbounded and bounded sample results.
Sessions
Press a time window to group
Sets
The difference between set, intersection, and.
The tuples inside the bag must be orderly.
Stats
Statistics related methods:
Two ways to calculate quantile. One is flow-type. Quantile includes the median.
Variance.
URLs
Used to differentiate between the user agent source (PC or phone). What system of mobile phone)
Complete the full text:)
Apache Datafu:linkedin Open-source Pig UDF Library