Preliminary study on Surus

Source: Internet
Author: User
Tags pmml

I. Overview

Surus is the NetFlix Open source UDFs, a data analysis toolbased on pig and hive .

Solve the problem

Surus functions can solve a variety of problems, such as fractional prediction model , anomaly detection and pattern matching, and so on,Surus can also be used as an assistant tool to improve the ability of big data analysis.

Second, the system architecture

the current open source The UDF feature consists of two main features, including scorepmml and robust Anomaly Detection (RAD).

Scorepmml

SCOREPMML is an efficient scoring forecasting model that leverages the Predictive Model markup language based on Pig implementations.

Robust Anomalydetection (RAD)

RAD is a kernel principal component analysis Method (robust PCA) based on robustness.

Third, the function of the detailed

theory + code

1,SCOREPMML

The application of predictive models is ubiquitous, but these applications are different, except that the creation and deployment of models is the same. The process is about making an idea for the engineer and creating a model with a small data set, and then expanding its model to see if it fits into big data.

PMML is an open source predictive Model Markup language that uses PMML as a standard to solve the proliferation of custom scoring methods in a Hadoop environment in SCOREPMML .

Code Entry /surus-master/src/main/java/org/surus/pig/scorepmml.java

PMML provides an efficient, basic modeling approach that supports fast loops. Each step in the modeling process uses the same PMML representation to save time and money by reducing the risk and cost of custom code.

2.robustanomaly Detection (RAD)

RAD is mainly used for anomaly detection, especially large data set anomaly detection. RAD uses the robust Principal Component analysis (RPCA) algorithm to detect anomalies. RPCA computes the SVD (singularvalue decomposition) iteratively, and identifies the low-level representation of the singular value and error of each calculation as a threshold value (a lower rank Representation), random noise, and outlier values. This algorithm has been packaged into a pig UDF, which can be called by simply adding some code to the pig code.

RAD has two important applications in Netflix and has achieved significant success.

Scenario One:

Netflix uses RAD to detect anomalies in bank-level payment network failures.

Scenario Two:

the registration process for the site is also an important application scenario. Users around the world register with hundreds of browsers and devices. When a user registers a problem, the engineer can help the user solve the problem by identifying the user using a combination of different devices and browsers to identify the cause in time.

Code Entry /surus-master/src/main/java/org/surus/pig/rad.java

code relies on /surus-master/src/main/java/org/surus/math/rpca.java,

/surus-master/src/main/java/org/surus/math/augmenteddickeyfuller.java,

/surus-master/src/main/java/org/surus/math/ridgeregression.java

code for the daytime:

An instance of the RAD class needs to pass in an array of this array, which is formatted as follows:

Private Static Final String[] argsDaily9 = newstring[]{"metric", "9", "7", "False"};

Metric Assign a value to rad.colname 9 assign to Rad.ncols, representing the number of columns of the matrix, 7 assign to rad.nrows< Span style= "font-family: ' The song Body '; >, which represents the number of rows of the matrix, false Rad.isforcediff - Fowler Test ( Dickey-fuller test or Null- Fowler test, false for not using Dickey -

Use the data set you want to detect The Tuple.append () method is added to the tuple object, and the tuple is added to the Databag using the Databag.add (tuple) method . Finally , the data set to be detected is converted to the type data required by RAD using the Tuple.append (Databag) method. The data is instrumented using the Rad.exec (Tupleinput) method. It is important to note that when the data is detected, the output format of the data should be formatted, using the method Rad.outputschema ().

in the The Rad.exec () method begins with some validation, including whether the identifier for the input data format is empty and the type of input data is judged to support the detection of that type of data. The code is:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/5B/35/wKiom1UBgInS577PAAJXA9TCzD0198.jpg "title=" 11.png "alt=" Wkiom1ubgins577paajxa9tczd0198.jpg "/>

To get input data, the data type of the input data is Databag, code databaginputbag = (databag) input.get (0); The data in the inputbag is iterated into memory, stored in the list, and the array used in this test to instantiate RAD matches the format of the data being detected, and the code is:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/5B/35/wKiom1UBgJzyNEGjAACBalcl-hY336.jpg "title=" 10.png "alt=" Wkiom1ubgjzynegjaacbalcl-hy336.jpg "/>

Converts the input data into a one-dimensional array with the following code:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/5B/2F/wKioL1UBgc2gALZrAAPonx8cNNI556.jpg "title=" 9.png " alt= "Wkiol1ubgc2galzraaponx8cnni556.jpg"/>

To determine whether to use the Dickey - Fowler test, the code is:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/5B/35/wKiom1UBgMCTGd_jAAINOJOVByI128.jpg "title=" 8.png " alt= "Wkiom1ubgmctgd_jaainojovbyi128.jpg"/>

RAD mainly uses the RPCA algorithm,RPCA is achieved by matrix computation, so the one-dimensional array is converted to a two-dimensional array, the code is:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M02/5B/2F/wKioL1UBgfCSegryAACHfU9dJCc919.jpg "title=" 7.png " alt= "Wkiol1ubgfcsegryaachfu9djcc919.jpg"/>

and then instantiate it with this two-dimensional array RPCA algorithm, code for

RPCA RSVD = newrpca (Input2darray, this. Lpenalty, this. spenalty);

Get RPCA related matrices, including the E,L,and S matrices, code:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M02/5B/35/wKiom1UBgQGxC2SAAAB4a1IOxek920.jpg "title=" 6.png " alt= "Wkiom1ubgqgxc2saaab4a1ioxek920.jpg"/>

The last is to build the observation value, the code is:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/5B/30/wKioL1UBgjCgAxQuAAJWNbsVJlg986.jpg "title=" 5.png " alt= "Wkiol1ubgjcgaxquaajwnbsvjlg986.jpg"/>

Iv. use of Help

1, Download the source code, address Https://github.com/Netflix/Surus

2, Enter Surus 's home directory to compile, execute command mvn clean Package (System to install Maven in advance )

when the command executes successfully, it A target folder appears in the Surus root directory ,

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/5B/35/wKiom1UBgRvyA3CaAAJJRk7fL-g739.jpg "title=" 4.png " alt= "Wkiom1ubgrvya3caaajjrk7fl-g739.jpg"/>

entering the target folder will have a Surus-xxx.jar jar package.

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/5B/30/wKioL1UBgkaA6a6ZAAJFIWItsZA591.jpg "title=" 3.png " alt= "Wkiol1ubgkaa6a6zaajfiwitsza591.jpg"/>

3. Place the Surus-xxx.jar in the pig/hive Lib directory and import the Surus-xxx.jar into the running environment. For example, in the pig script, the Surus-xxx.jar is registered in the runtime environment with the code: register Surus-xxx.jar;



This article is from the "Punk" blog, please be sure to keep this source http://yimaoqian.blog.51cto.com/1328422/1619832

Preliminary study on Surus

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.