[TOC]
This article refers to the Spark rapid Big data analysis, which summarizes the use of the RDD and mllib of the spark technology core and several of its key libraries. Initialize Operation
Spark Shell:bin/pyspark
Each spark application consists of a drive program (driver programs) that initiates various parallel operations on the cluster, the drive program contains the main function of the application, and the distributed datasets on the cluster are defined, and related operations are applied to those distributed datasets. The drive program accesses Spark (SC) through a Sparkcontext object that represents a connection to the compute cluster. You can use it to create an RDD.
Sc.textfile ("File-name") #创建一个代表文件的rdd
sc.parallelize ([,,]) #把一个已有的集合传递给spark
bin/spark-submit python-file.py
#初始化sparkContext: from
pyspark import sparkconf,sparkcontext
conf=sparkconf (). Setmaster ("local"). Setappname ("MYAPP")
Sc=sparkcontext (conf=conf)
RDD
The RDD operates in two ways: conversion and action, and the conversion operation will generate a new rdd from an RDD, and the action will calculate a result for the RDD. Spark's lazy calculation conversion operation is only really calculated when it is used in the first action operation.Common RDD Conversion Operations map (): Receives a function that uses this function for each element of the RDD and returns the result of the function as the value of the corresponding element in the RDD. Lambda expressionfilter (): Receives a function and returns the element in the RDD that satisfies the function into a new RDD.Flatmap (): Each INPUT element generates multiple output elements, and a simple application is to cut the input string into words.distinct (): Generates a new RDD that contains only the different elements. It costs a lot because it needs to mix all the data over the network (shuffle).Union (other): Returns an RDD containing all the elements of two rdd, repeatingintersection (Other): Returns the elements in all two rdd. Will repeat.Substract (Other): Only in the first Rdd, not in the second Rdd, will SuffleCartesian (Other): Calculates Cartesian set, returns all possible (A, B) pairs, uses: User's expected interest in various products, self-made Cartesian product, for the application of user similarity. -Large overheadSample(with Replacement,fraction,[seed]), sample RDD, and replacecommon RDD Action Operations Collect (): Returns all elements in the RDDcount (): Elements in the RDDcountbyvalue ():The number of occurrences of each element in the RDDTake (num): Returns num elements from an RDDTop (num): Returns the first NUM elementTakeorderd(num) (ORDERING-FUNC): Returns the first num elements from the RDD in the order providedRuduce (func): Consolidates all data in the RDD in parallelfold (zero) (func):Same as reduce, but requires an initial valueAggregate(Zerovalue (Seqop,combop)) is similar to reduce, but usually returns different types of functionsforeach (func): Use the given function for each element in the RDDKey-value pair operation pair RDD
Provides an operational interface for parallel operation of individual keys or for data grouping across nodes.
Reducebykey () can individually regulate the data corresponding to each key
Join () combines two elements with the same key in the RDD
Create a pair Rdd
Converting a normal rdd into a pair rdd,map the function passed by the operation needs to return the key value pair
Pairs=lines.map (lambda x: (X.split ("") [0],x)]
Conversion actions:
Reducebykey () merges values with the same key
Groupbykey () groups values with the same key
Combinebykey (createcombiner,mergevalue, merge Combiners,partitioner): Merges values with the same key with different return types
Mapvalues (func) applies a function to each value without changing the key value
Flatmapvalues (func) applies a function that returns an iterator to each value, and then generates a key-value pair record for each returned element that corresponds to the original key, typically used to symbolize
Keys () returns an RDD containing only the keys
VALUES () returns a containing only the worth of the RDD
Sortbyket () returns an RDD ordered by key
Conversion operation for two pair Rdd p43
Subtractbykey (Other) deletes the same element in the RDD key as the key in the other Rdd MLlib
The mllib contains only parallel algorithms that can run well on the cluster.
Pre-installed Gfortran linear algebra runtime
Data type
Located in the Org.apache.spark.mllib package:
Vector: Created by the Mllib.linalg.vectors class
From bumpy import array from
pyspark.mllib.linalg import vectors
Create dense vectors
Densevec1=array (1.0,2.0.3.0]) #直接传numpy数组
densevec2=verctors.dense ([1.0,2.0,3.0])
Creates a sparse vector that receives only the dimensions of the vector and non-zero position and value
These locations can be passed with a dictionary, or using two lists that represent the position and value respectively
Sparsevec1=vectors.sparse (4,{0:1.0,2:2.0})
Sparsevec2=vectors.sparse (4,[0,2],[1.0,2.0])
Labledpoint: In supervised learning such as classification and regression, a labeled data point that contains a eigenvector and a label (represented by a floating-point number), in Mllib.regression
Rating: User ratings for a product, in Mllib.recommendation
Various model classes: The result of the training algorithm, there is generally a predict () method, used for new data points or data points of the RDD composed of the model to be used to predict feature extraction
Mllib.feature
TF-IDF:
HASHINGTF: Calculates the word frequency vector of a given size from a document, using the hash method, which requires each "document" to be represented by an iterative sequence of objects.
#IDF计算逆文档频率 from
pyspark.mllib.feature import HASHINGTF,IDF
rdd=sc.wholetextfiles ("Data"). Map (Lambda (name , text): Text.split ())
TF=HASHINGTF ()
tfvectors=tf.transform (RDD). Cache ()
#计算idf
idf=idf ()
Idfmodel= Idf.fit (tfvectors)
tfidfvectors=idfmodel.transform (tfvectors)
Scaling: Normalization , taking into account the amplitude of each element in the eigenvectors, and performing best when feature scaling is adjusted to equal treatment, such as a characteristic average of 0 and a standard deviation of 1
Method: Create an object of the Standardscaler class, call the Fit () function on the dataset to get a Standardscalemodel (that is, calculate the average and standard deviation for each column), and then use the transform () of the Model object method to scale a data set.
regularization: The length of the vector is normalized to 1 nomalizer.transform (RDD), the general case is the L2 paradigm, Euclidean distance.
Word2vec: A text feature algorithm based on neural network, which needs to be passed to a corpus that is represented by a string class of iterable. After training Word2vec.fit (RDD), get a word2vecmodel that can be used to convert each word through transform () to a vector. The size of the algorithm model equals the number of words in the thesaurus multiplied by the size of the vector
Statistics Mllib.stat.Statistics
Statistics.colstats (RDD): Calculates a statistical overview of the RDD composed of vectors, preserving the minimum, maximum, average, and variance of each column in a vector collection.
Statistics.corr (Rdd,method) calculates the correlation matrix between columns in an rdd composed of vectors, using Pearson correlation or Spearman correlation, and method must be a Pearson or Spearman
Statistics.corr (Rdd1,rdd2,method) calculates the correlation matrix of two rdd composed of floating-point values
Statistics.chisqtest (RDD) calculates the Pearson independence test result for each feature and label in an Rdd composed of Labledpoint objects, returning a Chisqtestresult object with P-value, Test statistics and degrees of freedom for each feature. Labels and features must be discrete. linear regression
Classification and regression, supervised learning, all used to Mllib.regression.LabledPoint class, Lable+freature vector
Refers to the linear combination of features to predict output values, also supports regular regression of L1 and L2, Lasso and Ridge regression
MLLIB.REGRESSION.LINEARREGRESSIONWITHSGD,LASSOWITHSGD,RIDGEREGRESSIONWITHSGD.
SGD represents a random gradient descent method
These classes have several parameters that you can use to tune the algorithm:
Numiterations, number of iterations to run, default 100
Stepsize, step of gradient descent
Intercept: Whether to add a disturbance feature or deviation feature to the data, which is always a value of 1, the default value is False
Regularization parameters of Regparam:lasso and Ridge
Form pyspark.mllib.regression Import labledpoint from
pyspark.mllib.regression import LINEARREGRESSIONWITHSGD
points=# (Create labledpoint composed of Rdd)
model= Linearregressionwithsgd.train (points,iteration=200,intercept=true )
print "Weight:%s, intercept:%s"% (model.weights,model.intercept)
Logistic regression
is a binary classification method, which is used to find a linear separation plane of segmented data. SGD/LBFGS algorithms can be supported
Http://blog.sina.com.cn/s/blog_eb3aea990101gflj.html
Logisticregressionmodel can find a score between 0 and 1 for each point, and then return 0 and 1 based on a threshold, which, by default, returns 0.5 for 1, and can change the threshold by Setthreshold (). You can also remove the threshold setting by Clearthreshold () so that predict () returns the original score. Support Vector Machine
Using the linear split plane two-tuple algorithm, expected 0 or 1 of the label, through the SVMWITHSGD class, we can access this algorithm, return Svmmodel naive Bayes
Multivariate classification algorithm, using feature-based linear functions to calculate a point to a variety of scores, this algorithm is usually used for TF-IDF features of the amount of text classification, mllib implementation of a number of naïve Bayesian algorithm, the need for non-negative frequency as the input characteristics.
The Mllib.classification.NaiveBayes class uses the Naive Bayes algorithm, supports a parameter Lambda_, is used for smoothing, can invoke the naive Bayesian algorithm for a labledpoint composed of an RDD, for C classification, The range of tag values is 0-c-1, return Naivebayesmodel, you can use predict () to predict the most appropriate classification for a point, or you can access the two parameters of a trained model: the probability matrix of each feature and the classification theta (for the case of C and D features, Matrix size is c*d), and the C-dimensional vector pi decision tree of prior probability and random forest
Decision trees can be used to classify or be used for regression, in the form of a node tree, each node based on the characteristics of the data to make a two-dollar decision, and the tree de each leaf node contains a prediction results, advantages: The model itself is easy to check, both support the classification of features, but also support continuous features.
The static methods in the Mllib.tree.DecisionTree class Trainclassifier () and trainregressor to train the decision tree, and the training method receives the following parameters:
Data: Rdd made up of Labledpoint
Numclasses: Number of categories to use
Impurity, the degree of impurity of the node, for the classification can be Gini or entropy, for the regression must be variance.
MaxDepth, maximum depth of tree (default: 5)
Maxbins: Divide data into multiple chests when building each node, recommended value 32
Categoricalfeaturesinfo, a mapping table that specifies which features are categorized, and how many categories they have, for example, if feature 1 is a two-element feature of label 0 or 1, and feature 2 is a ternary feature labeled 0,1,2, you can pass {1:2,2:3} , if no feature is classified, pass an empty mapping table
The train () method returns a Decisiontreemodel object whose predict () method predicts the corresponding value for a new eigenvector, or predicts a vector rdd that can be used to output the tree using Todebugstring ().
The Randomforest class can be used to build a set of trees, random forests, Randomforest.trainclassifier (), and Trainregressor (), receiving the following parameters:
Numtrees building the number of trees, increasing numtrees can reduce the likelihood of overfitting the training data.
Featuresubsetstrategy the number of features to consider when making decisions on each node, which can be auto,all,sqrt,log2 and one third, the larger the value the greater the cost.
Seed of the random number used by seed
Random Forest Returns a Weightedensemblemodel object that contains several decision trees (in the Weakhypotheses field, the weights are determined by weakhypothesisweights), You can call predict () on the Rdd or vector, and todebugstring () can output all the treesClustering
Mllib.clustering.KMeans calls train (), receives an RDD consisting of a vector as a parameter, returns an Kmeansmodel object, and accesses the Clustercenters property of the object (the cluster center, is an array of vectors), called predict (), returns the nearest cluster center of the Pity Dorado cluster
parameter:
Initializationmode is used to initialize the cluster center, which can be k-means| | or random, the former is the default value, the general effect is better, But at a high cost.
MaxIterations The maximum number of iterations that the default value is
runs algorithm concurrent runs collaborative filtering and recommendation
Coordinated filtering is a recommended system technique for recommending new products based on user interaction and scoring of various products
alternating least-squares ALS is a common algorithm for collaborative filtering, In the Mllib.recommendation.ALS class, ALS sets a feature vector for each user and product, so that the dot product of the user vector is close to their score.
Parameters:
Rank uses the size of the eigenvectors, the larger eigenvectors will produce a better model, but the cost is also larger, the default is ten
iterations to perform the iteration number of the default value of
Lambda regularization parameters
Alpha is used to Constants for calculating confidence in implicit ALS, the default value 1.0
Numuserblocks.numproductblocks the number of blocks for user and product data to be segmented, to control the degree of parallelism, and to pass -1,mllib to automatically decide to use ALS, An rdd consisting of a Mllib.recommendation.Rating object is required, each containing a user ID, a product ID, a rating, and an ID that needs to be a 32-bit shaped value, or a hash value if it is a large number or a string, or broadcast () a table from the product ID to the shaping value.
als Returns a Matrixfactorizationmodel object to represent the result, and can call predict () to find the most recommended first numproduct product for a user (userid,numproduct), The Matrixfactorizationmodel object is large, and a vector is stored for each user and product. The model-generated model.userfeatures and model.productfeatures are saved on the Distributed file system, and the
default is to display ratings, implicit feedback needs to call ALS.TRAINIMPLICIT () when the score is displayed, Each user's rating for a product is a score forecast is also scored, implicit feedback, each score represents the confidence of the user and the given product to interact, the forecast is also the confidence dimensionality reduction
PCA: Mapping data to low-dimensional space, to maximize the variance of data representation in low-level space, to calculate this mapping, we need to construct a regularization correlation matrix, and use the matrix of the singular vector and singular values, and the largest part of the singular value corresponding to the singular vector can be used to reconstruct the primary component of the original data.
The Mllib.linalg.distributed.RowMatrix class represents the matrix and then stores an rdd composed of vectors, one per line
Scala's PCA
Import Org.apache.spark.mllib.linalg.Matrix
Import Org.apache.spark.mllib.linalg.distributed.RowMatrix
Val points:rdd[vector]=//
Val mat:rowmatrix=new rowmatrix (points)
Val pc:matrix= Mat.computeprincipalcomponents (2)
//projecting a point into a low-dimensional space
val projected=mat.multiply (PC). Rows
// Train the K-means model on two-dimensional data projection
val model=kmeans.train (projected,10)
singular value decomposition
SVD will decompose a m*n matrix A into three matrices a~utvt,u is an orthogonal matrix whose column is called the left singular vector, and t is a diagonal matrix with both a non-negative and descending rank on a diagonal, where the German value is called the singular value and V is an orthogonal matrix. Its column is called the right singular vector
For large matrices, it is usually not necessary to complete the decomposition, only the singular values and the corresponding singular vectors can be decomposed, which saves storage space, reduces noise, and facilitates the restoration of low-rank matrices. If the first k singular values are preserved, then the result matrix will be u:m*k,t:k*k,v:n*k
Val SVD:
singularvaluedecomposition[rowmatrix,matrix]=mat.somputesvd (20,computeu=true)
val U:rowmatrix = Svd. U//u is a distributed Rowmatrix
val s:vector = svd.s//singular value, with a local dense vector representing
Val v:matrix=svd. V//v is a locally dense matrix
Model Evaluation
Mllib.evaluation packages, using the Binaryclassificationmetrics and Multiclassmetrics classes, you can create a metrics object from a (predictive, factual) pair of Rdd, and then calculate such things as accuracy, Recall rate, recipient operation characteristics, such as the area under the ROC curve, these methods run on a non-training set, generating an RDD cache value consisting of (forecast, fact) pairs, if not in memory, can be used persist (storagelevel.disk_only)