The way of Big Data processing (MATLAB < three >)

Source: Internet
Author: User

One: Cause

(1) Recently has been dealing with big data, from MB----> GB changes, is a qualitative leap, the corresponding tools are also changing from widows to Linux, from single-machine to Hadoop multi-node computing

(2) The problem is, in the face of huge amounts of data, how to tap into practical information or to find potential phenomena, visual tools may be essential;

(3) Visualization tool can say Baidu a big article, but as the researcher of us, the program ape we may want to be able to abstract a mathematical model, the reality of the phenomenon of very good description and characterization

(4) Python (data cleansing and processing) + MATLAB (model analysis) or c++/java/hadoop ( data cleansing and processing ) + MATLAB ( model analysis )

(5) A previous blog post can refer to the C + + FStream + string processing big data and the way to data Processing (MATLAB Chapter (ii))

(6) The program ape despise the people who study matlab, because the understanding of MATLAB is not deep enough,Matlab is a combination of matrix&laboratory two words, meaning matrix factory (Matrix Lab) To talk about the processing matrix (which is actually a numerical array) calculation problem that is the fastest and easiest. MATLAB can perform matrix operations, draw functions and data, implement algorithms, create user interfaces, connect programs to other programming languages, and more.

II: MATLAB Learning (Traverse folder , Matrix re-assembly, PCA)

(1) Save (Tofilename, ' ans ', '-ascii ') saves the results of the ANS matrix into the development path Tofilename

(2) num2str (num) converts a number to a string type;

(3) strcat (Rootpath,num2str (i), ' \*.csv ') string concatenation function for the generation of absolute paths

(4) [Coef,score,latent,t2] = princomp (data); The main component analysis method, latent is the contribution rate sequencing (from large to small), score is the new data generated, sorted according to the contribution rate

Three: PCA explanation

(1) Feature extraction is a new feature that maps the characteristics of high latitude to a low latitude as a function. The common feature extraction method is PCA

(2) when the contribution rate is accumulated to 95% (when the requirements are not particularly strict, more than 85% can also), the future dimension will no longer display; so depending on the contribution rate (for example, the first two have reached 95%), then the final can be reduced to 2 dimensions, that is, you can select only the first two columns of score to represent the original data.

(3) PCA algorithm steps:
with M-bar n-dimensional data.
1) make the original data column n rows m column matrix X
2) Each line of x (representing an attribute field) is 0-valued, minus the mean of the line
3) Finding the covariance matrix C=\frac{1}{m}xx^\mathsf{t}
4) Finding the eigenvalues and corresponding eigenvectors of the covariance matrix
5) The eigenvector is arranged into a matrix according to the corresponding eigenvalue size from top to bottom, and the first k line is composed of the matrix P
6) Y=PX is the data after dimensionality reduction to K dimension

Four: Program (note more detailed)

Clc;clear all;close all;%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%for i=1:7% source folder path RootPath = ' G:\zyp_th    Anks\metro_test\resultmergeodbyday_6\ ';    % Output Path Torootpath = ' G:\zyp_thanks\metro_test\resultMergeODByDay_6_zhengyu\ ';    % source folder path (gets files of the specified type) path = strcat (Rootpath,num2str (i), ' \*.csv ');    % Output Path Topath = strcat (Torootpath,num2str (i), ' \ ');    % Create output folder mkdir (Topath);   Dirs=dir (path); % is replaced with the path you want.    Reads a list of specified types of files for a directory, returning a structure array. Datadir=strcat (Rootpath,num2str (i), ' \ ');% Data Catalog%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% dircell=    Struct2cell (dirs) ';    % struct (struct) Convert Narimoto cell type (cell), transpose is to let the file name arranged by column.   Filenames=dircell (:, 1);    The first column is the file name [m n] = size (filenames);        for i=1:m strFileName = [DataDir filenames{i}];        Tofilename = [Topath filenames{i}];       %fprintf (' File%d:%s\n ', i,strfilename);       X = Load (strFileName);       %a,b is the data that is obtained that does not need to be changed, and the following is used as a = X (:, 1:2);       B = X (:, 6:7);  % get three-dimensional vector, used as PCA Transform data = X (:, 3:5);    %PCA [Coef,score,latent,t2] = princomp (data);      NewData = Score (:, 1:2);% The first 2 columns ANS = [A newdata B]; Save (Tofilename, ' ANS ', '-ascii ') endend

The way of Big Data processing (MATLAB < three >)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.