The way of Big Data Processing (MATLAB Chapter (ii))

Source: Internet
Author: User

One: Cause

(0) The beginning of the individual is very incompatible with the MATLAB programming language, Ken can be part of the programmer's common problem-learn C + + or Java, will despise other languages, too lazy to try other languages. Until one day ... He found that he or she found that the language he was proficient in could not solve the problem before making a change.

(1) Recently has been dealing with big data, from MB----> GB changes, is a qualitative leap, the corresponding tools are also changing from widows to Linux, from single-machine to Hadoop multi-node computing

(2) The problem is, in the face of huge amounts of data, how to tap into practical information or to find potential phenomena, visual tools may be essential;

(3) Visualization tool can say Baidu a big article, but as the researcher of us, the program ape we may want to be able to abstract a mathematical model, the reality of the phenomenon of very good description and characterization

(4) Python (data cleansing and processing) + MATLAB (model analysis) or c++/java/hadoop ( data cleansing and processing ) + MATLAB ( model analysis )

(5) A previous post can refer to C + + FStream + string processing Big Data

Second: Matlab function explanation

(1)matlab function Learning, personally think that help + Baidu is enough, this is a relatively rapid learning and application of the method

(2) There are other languages to do the basis,Matlab is very easy to get started

(3) Multivariate linear regression (regress)

Clcclear allclose allx = Load (' G:\zyp_thanks\multi regression\ traffic flow forecast data \dldatajia1.csv '); Y = Load (' G:\zyp_thanks\multi regression\ traffic flow prediction Data \dllabel.csv ') xlabel (' Test x axis '); Ylabel (' Test y-axis '); title (' Regression analysis table ') [b , Bint,r,rint,stats]=regress (Y, X, 0.9);%returns a p-by-1 vector B of coefficient estimates for a multilinear%regression o f The responses in Y to the predictors in X. x is a n-by-p%matrix of P predictors at each of n observations. Y is an n-by-1 vector of%observed responses.uses a 100* (1-alpha)% confidence level y_predict = x*b;% two curve plot (Y, ' R '); hold O N Plot (y_predict, ' B '); Error = ABS (Y_PREDICT-Y);% mean absolute error mean (error)

(4) Function Description:

For linear regression, there are 4 basic assumptions: There is a linear relationship between the ① and the independent variables, the ② residuals are separate, the ③ residuals satisfy the variance singularity, and the ④ residuals satisfy the normal distribution.
In the MATLAB software package there is a general multivariate regression analysis of the command regeress, the invocation format is as follows: [B, Bint, R, rint, stats] = regress (Y,x,alpha) or [B, Bint, R, rint, stats] = Regress (y,x) at this point, the default alpha = 0.05. Here, y is a column vector, X is a matrix, where the first column is a full 1 vector (this is important for regression, a full 1-column vector corresponds to the constant term of the regression equation), in general, it is necessary to manually create a full 1-column vector.
In the return item [B,bint,r,rint,stats],
① is the coefficient of the regression equation; ② is a matrix, which is the line of the (1-alpha) confidence interval, the ③ is the residual column vector, and the ④ is the Matrix, and its first row represents the (1-alpha) confidence interval of the first residuals;
Comments:bint is the interval estimation of regression coefficients, R is residual, Rint is the confidence interval, stats is the statistic used to test the regression model, stats: The first item: correlation coefficient; the second: F statistics ( generally, the larger the f_ test, the better) The third item: the probability p corresponding to the statistic F; Fourth: estimate the error variance. Alpha is a significant level (the default is 0.05). The greater the correlation coefficient r^2, the more significant the regression equation is, and the probability p<alpha with F to reject H0, and the regression model is established.

Hold on and hold off, is relative use
The former means that you draw a picture in the axis (coordinate system) of the current graph, and then draw another picture, the original diagram is still there, and the new diagram coexist, all can see
The latter expresses, you draw a picture in the axis (coordinate system) of the current graph, at this time, the state is hold off, then draw another picture, the original figure is not visible, on the axis is a new diagram, the original image was replaced


(5) PCA + regress

Clear Allclose allx = Load (' G:\zyp_thanks\multi regression\ traffic flow forecast data \dldata.csv '); Y = Load (' G:\zyp_thanks\multi regression\ traffic flow prediction Data \dllabel.csv ')%PCA [coef,score1,latent,t2] = Princomp (X);%return ... The scores is the data formed by transforming the Origtinal%data to the space of the principal components ...  X =x*coef '; % Original X_model = x (1:600,1:10),% reads first 600 rows first 10 columns Y_model = Y (1:600),% reads first 600 lines x_test = x (601:1052,1:10); Y_test = Y (601:1052); B=regress (Y_model, X_model);% training Set y_predict = x_model*b;% two curve plot (Y_model, ' R '); Hold on plot (y_ Predict, ' B '); Error = ABS (Y_predict-y_model),% average absolute error mean (error)% test set% modified graphics Xlabel (' time interval (10min) '),% x-axis annotation ylabel (' speed value (km/h) '), title ( ' MULTILINEAR_TESTPCA ');% graphics title legend (' Training set-real value ', ' training set-predictive value '); % graphical note grid on; % Display Gridlines Y_predict = x_test*b;% Two curves figure,plot (y_test, ' R '); Hold on plot (y_predict, ' B '); Error = ABS (y_predict-y_test),% average absolute error mean (Error)% modified graphics Xlabel (' time interval (10min) '),% x-axis annotation ylabel (' speed value (km/h) '), title (' MULTILINEAR_TESTPCA ');% graphics title legend (' Test set-real value ', ' test set-predictive value '); % graphical note grid on; % Display Gridlines

(6) Function explanation Description Princomp function

Contribution rate: Each dimension of the data to differentiate the contribution of the entire data, the largest contribution rate is obviously the main component, the second largest is the secondary principal component ... [Coef,score,latent,t2] = Princomp (x); (personal view):
x: The n-dimensional raw data to be entered. Bring in this MATLAB self-function, will generate new n-dimensional processed data (i.e. score)。 This data corresponds to the previous n-dimensional raw data one by one.
Score: The resulting n-dimensional processed data exists in score. It is the analysis of the raw data, and then the data obtained in the new coordinate system. He ranked the n-dimensional data by the contribution rate from large to small. (i.e., in the case of changing the coordinate system, the n-dimensional data is sorted)
latent: is a one-dimensional column vector, each data is corresponding to the corresponding dimension of score contribution rateBecause the data has n dimensions, so the column vectors have n data. By large to small rank (since score is also by contribution rate by large to small rank).
Coef: is the coefficient matrix. By CofE you can tell how X is converted into score.
The model is based on the original data:
Score= Bsxfun (@minus, X,mean (x,1)) *coef; (function: The test data can be transformed into a new coordinate system by this method)
Inverse transformation:
x= Bsxfun (@plus, SCORE*INV (COEF), mean (x,1))
Previous Error Awareness:
1. It is considered that the contribution value shown by latent in the principal component analysis is the original data, in fact it is the processed data. Explanation: Since the PCA method is chosen for the original data, the computer thinks that there may be an association between each dimension of the original data, and you want to remove the association and reduce the dimensionality. So this method is used. So the computer does not care about the contribution of the original data, because you will not use it, using the processed data (which is why the score and latent are not affected when the order of each dimension of the input data is changed).
2. It is considered that PCA analysis automatically reduce the dimension, wrong. PCA will have a contribution value, is the input according to their own contribution to the value of the number of changes in the dimension, and then generate data. (Generally we will take the contribution value above 85%, ask for a little higher 95%).using your original matrix X*coeff (:, 1:n) is the new data you want, where n is the number of dimensions you want to drop.
3.PCA analysis, based only on the characteristics of the input data, principal component analysis, and the output of how many types, each data corresponding to which type is irrelevant. If the sample has been divided into a good type, then the PCA will inevitably have a certain impact on the accuracy of the results, I think that the PCA for this kind of data is to find a balance between dimensionality and accuracy of the problem, so that the data is not more than the dimensionality of the operation complex, but also a higher resolution.
(7) reading of the matrix data
If there is a 4x3 matrix, select the first three lines to form a new matrix, and then select the first two columns to form another matrix.
A=[1 2 3;4 5 6;7 8 9;10 11 12];
B=a (1:3,:)
B=[1 2 3;4 5 6;7 8 9]
C=a (:, 1:2)
C=[1 2;4 5;7 8; 10 11]
Description ': ' represents fetch all, ', ' preceded by a row, followed by a column. If ', ' preceded by ': ' Then all rows are taken, if ', ' followed by ': ', then all are listed.
B=a (1:3,:) 1:3 for 1 to 3 rows, all listed.
C=a (:, 1:2) 1:2 represents 1 to 2 columns, and all rows are taken.

The way of Big Data Processing (MATLAB Chapter (ii))

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.