Sparkr Concise User's Manual

Source: Internet
Author: User
Tags hypot sparkr

1. Installation configuration of Sparkr 1.1. Installation of R and Rstudio installation 1.1.1. R

Our working environment is operated under Ubuntu, so we only introduce the method of installing R under Ubuntu:

1) Add Source in/etc/apt/sources.list

Deb Http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/,

Then update the source Apt-get update;

2) installation via Apt-get:

sudo apt-get install R-base

1.1.2. Installation of Rstudio

The official website has detailed introduction:

http://www.rstudio.com/products/rstudio/download-server/

sudo apt-get install Gdebi-core

sudo apt-get install Libapparmor1 # Required only for Ubuntu, not Debian

wget Http://download2.rstudio.org/rstudio-server-0.97.551-amd64.deb

sudo gdebi rstudio-server-0.97.551-amd64.deb

1.2. Rjava install 1.2.1. Rjava Introduction

Rjava is a communication interface between the R language and the Java language, implemented by the underlying JNI implementation, allowing direct invocation of Java objects and methods in R.

Rjava also provides the functionality of Java invoke R, which is implemented through Jri (JAVA/R Interface). Jri is now embedded in the Rjava package, and we can also try this feature individually. Now Rjava package, has become a lot of Java development based on the basic functional components of the R package.

It is because Rjava is the underlying interface and is called using JNI as an interface, so it is very efficient. In the Jri scenario, the JVM loads the RVM directly through memory, and the call process performance is virtually lossless, making it a very efficient connection channel and a preferred development package for R and Java communications.

1.2.2. Rjava Installation

1) Configuring the Rjava environment

Execute R-CMD javareconf

[Email protected]:/home/wupeidun# R CMD javareconf

2) Start R and install Rjava

[Email protected]:/home/wupeidun# R

> install.packages ("Rjava")

1.3. SPARKR installation 1.3.1. Sparkr's Code download

Download code from webpage sparkr-pkg-master.zip https://github.com/amplab-extras/SparkR-pkg

1.3.2. Sparkr Code Compilation

1) Unzip the Sparkr-pkg-master.zip and then CD sparkr-pkg-master/

2) You need to specify the Hadoop version and the spark version when compiling

spark_hadoop_version=2.4.1 spark_version=1.2.0./install-dev.sh

At this point, the standalone version of the SPARKR has been installed.

1.3.3. Deployment configuration for Distributed Sparkr

1) After the successful compilation, will generate a Lib folder, into the Lib folder, packaging Sparkr for SparkR.tar.gz, which is the key to distributed SPARKR deployment.

2) Install SPARKR on each cluster node by the packaged SparkR.tar.gz

R CMD INSTALL SparkR.tar.gz

The distributed Sparkr is now complete.

2. Operation of Sparkr 2.1. Operating mechanism of SPARKR

Sparkr is an R development package released by Amplab that provides a lightweight front end for Apache Spark. SPARKR provides a flexible distributed data set (RDD) API in spark that allows users to run jobs interactively through R shell on a cluster. Sparkr the benefits of Spark and R, and the following 3 illustrations illustrate how the Sparkr works.

2.2. Use SPARKR for data analysis 2.2.1. SPARKR Basic Operation

First introduce the basic operation of the following sparkr:

The first step is to load the SPARKR package

Library (SPARKR)

Step two, initialize spark context

SC <-sparkr.init (master= "spark://localhost:7077"

, Sparkenvir=list (spark.executor.memory= "1g", spark.cores.max= "10"))

Third, reading the data, the core of Spark is the resilient distributed Dataset (RDD), Rdds can be created from the inputformats of Hadoop (for example, HDFs files) or by converting other rdds. For example, the following is an example of reading data directly from HDFs as an rdd:

Lines <-Textfile (SC, "Hdfs://testnode6.yytest.com:8020/tmp/sparkr_test.txt")

Alternatively, you can create an rdd from a vector or list by using the Parallelize function, such as:

Rdd <-Parallelize (SC, 1:10, 2)

When we get here, we can use the Rdd action (actions) and the transform (transformations) to manipulate the RDD and generate a new RDD, or we can easily invoke the R development package. You only need to read the R development package with Includepackage before performing the operation on the cluster (example: Includepackage (SC, Matrix)), and of course it can be manipulated by converting the RDD to the data form of the R language format.

Refer to the following two links for details:

http://amplab-extras.github.io/SparkR-pkg/

Https://github.com/amplab-extras/SparkR-pkg/wiki/SparkR-Quick-Start

So let's take a look at the two examples below to see how SPARKR works.

2.2.2. Sparkr Use Example

1) Example1 : Word Count

p># load Sparkr package

Library (SPARKR)

# Initialize Spark context

SC <-sparkr.init (master= "host:7077"

                 , Sparkenvir=list (spark.executor.memory= "1g", spark.cores.max= "))

# read files from HDFs

Lines <-textfile ( SC, "Cluster IP: Port/tmp/sparkr_test.txt")

# splits each behavior by delimiter multiple elements, which returns a sequence of

Words<-flatmap (line) {Lines,function Strsplit (line, "\\|") [[1]]})

# uses lapply to define an operation for each RDD element, which returns a (K,V) pair

WordCount <-lapply (words, function (word) {list (Word, 1L)})

# pair (k,v) for aggregation calculation

Counts<-reducebykey (WordCount, "+", 2L)

# Returns all elements of the dataset in the form of an array

Output <-Collect (counts)

# Output results by format

for (wordcount in output) {

  cat (Wordcount[[1]], ":", Wordcount[[2]], "\ n")

}

2) Example2 : Logistic regression

# Load Sparkr Package

Library (SPARKR)

# Initialize Spark context

SC <-sparkr.init (master= "host:7077",

Appname= ' Sparkr_logistic_regression ',

Sparkenvir=list (spark.executor.memory= ' 1g ',

spark.cores.max= "10"))

# Read TXT file from HDFs, consisting of 4 partitions from the Spark cluster

Input_rdd <-Textfile (SC,

"hdfs://Cluster IP: Port/user/payton/german.data-numeric.txt",

MINSPLITS=4)

# parse the text of each RDD element (parallel on each partition)

Dataset_rdd <-lapplypartition (Input_rdd, function (part) {

Part <-lapply (part, function (x) unlist (Strsplit (x, ' \\s '))

Part <-lapply (part, function (x) as.numeric (x[x! = "])

Part

})

# We need to split the dataset Dataset_rdd into two parts of the training set (train) and test set, where

# ptest is a sample scale for the test set, such as taking ptest=0.2, which takes the 20% samples of Dataset_rdd as the test

# Set, 80% sample count as training set

Split_dataset <-function (Rdd, ptest) {

#以输入样本数ptest比例创建测试集RDD

Data_test_rdd <-lapplypartition (RDD, function (part) {

Part_test <-part[1: (Length (part) *ptest)]

Part_test

})

# Create a training set RDD with the remaining number of samples

Data_train_rdd <-lapplypartition (RDD, function (part) {

Part_train <-part[((Length (part) *ptest) +1): Length (part)]

Part_train

})

# return the list of test set RDD and training set RDD

List (Data_test_rdd, Data_train_rdd)

}

# Next we need to convert the dataset into the matrix of the R language, and add a column with a number of 1 intercept items,

# normalize the output item Y to form 0/1

Get_matrix_rdd <-Function (RDD) {

Matrix_rdd <-lapplypartition (RDD, function (part) {

M <-Matrix (Data=unlist (part, F, f), ncol=25, byrow=t)

M <-cbind (1, m)

M[,ncol (m)] <-M[,ncol (m)]-1

M

})

Matrix_rdd

}

# because the value of Y in this training set is 1 and the sample number of 0 is 7:3, we need to balance 1 and 0 samples

# number so that they match the number of samples

Balance_matrix_rdd <-Function (Matrix_rdd) {

Balanced_matrix_rdd <-lapplypartition (Matrix_rdd, function (part) {

Y <-part[,26]

Index <-sample (which (y==0), Length (which (y==1)))

Index <-C (index, which (y==1))

Part <-Part[index,]

Part

})

Balanced_matrix_rdd

}

# Split data set for training and test sets

DataSet <-Split_dataset (Dataset_rdd, 0.2)

# Create Test Set RDD

Matrix_test_rdd <-Get_matrix_rdd (dataset[[1])

# Create training Set RDD

Matrix_train_rdd <-Balance_matrix_rdd (Get_matrix_rdd (dataset[[2]))

# put the training set RDD and test set RDD into spark distributed cluster memory

Cache (MATRIX_TEST_RDD)

Cache (MATRIX_TRAIN_RDD)

# initialization Vector Theta

theta<-runif (n=25, min =-1, max = 1)

# Logistic functions

Hypot <-function (z) {

1/(1+exp (-Z))

}

# gradient calculation of loss function

Gcost <-Function (t,x,y) {

1/nrow (x) * (t (x)%*% (Hypot (x%*%t)-y))

}

# define Training functions

Train <-function (theta, RDD) {

# Calculate gradients

Gradient_rdd <-lapplypartition (RDD, function (part) {

X <-part[,1:25]

Y <-part[,26]

P_gradient <-gcost (theta,x,y)

List (list (1, p_gradient))

})

Agg_gradient_rdd <-Reducebykey (gradient_rdd, ' + ', 1L)

# One Iteration aggregation output

Collect (Agg_gradient_rdd) [[1]][[2]

}

# optimized loss function by gradient descent algorithm

# Alpha: Learning rate

# steps: Number of iterations

# Tol: Convergence accuracy

Alpha <-0.1

Tol <-1e-4

Step <-1

while (T) {

Cat ("Step:", step, "\ n")

P_gradient <-Train (theta, Matrix_train_rdd)

Theta <-Theta-alpha*p_gradient

Gradient <-train (theta, Matrix_train_rdd)

if (ABS (Norm (gradient,type= "F")-norm (p_gradient,type= "F") <=tol) break

Step <-step+1

}

# Use a trained model to predict test set credit evaluation results ("good" or "bad") and calculate forecast accuracy

Test <-lapplypartition (Matrix_test_rdd, function (part) {

X <-part[,1:25]

Y <-part[,26]

y_pred <-Hypot (X%*%theta)

Result <-xor (As.vector (Round (y_pred)), As.vector (y))

})

Result<-unlist (collect (test))

corrects = Length (result[result==f])

Wrongs = Length (result[result==t])

Cat ("\ncorrects:", corrects, "\ n")

Cat ("Wrongs:", wrongs, "\ n")

Cat ("Accuracy:", corrects/length (y_pred), "\ n")

Sparkr Concise User's Manual

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.