PCA transformation based on gdal (Principal Component Analysis)

Source: Internet
Author: User

Principal Component Analysis (PCA) is a multivariate statistical analysis method that uses linear transformation to select a small number of important variables. It is also called Main Component analysis. In practice, many variables (or factors) related to this issue are often proposed for comprehensive analysis, because each variable reflects certain information of this topic to varying degrees. Principal component analysis is first introduced by K. Pearson to non-random variables, and then H. huotlin will promote this method to random vectors. The information size is usually measured by the sum of squares or variance of the deviation. For remote sensing images, each variable is a band. Principal component analysis is a commonly used analytical method in remote sensing image processing. It is useful for extracting effective information, especially for hyperspectral data.

Principal component analysis mainly refers to the statistical method that converts multiple original indicators into a few comprehensive indicators. That is to say, use as few indicators as possible to reflect as many indicators as possible, and these few indicators are independent.

If there are P indicators, x1, x2,..., XP, they are combined into M (<p) indicators, Z1, Z2,..., ZM. The geometric meaning of principal component transformation is to find out the problem of the spindle of the elliptical body in the p-dimensional space. It is easily known from the mathematics that they are x1, x2 ,..., feature vectors corresponding to m large feature values in the XP correlation matrix.

The general steps for PCA are as follows:

1: The original data is normalized (standardized). The mean value and variance of the normalized data are 0 and 1 (I personally think that the dimensions of all the bands are the same for remote sensing data, you can skip the first step)

The normalization equation is Xn (A, j) = (X (A, j)-xjm)/σ J, where Xn (A, j) is the normalized data, X (A, j) is the original data, xjm is the mean of the original data, and σ J is the standard deviation of the original data xjm = sum (X (A, j )) /n σ J * σ J = sum (X (A, j)-xjm) * (X (A, j)-xjm)/n

2: Calculate the correlation coefficient matrix R

Correlation coefficient calculation equation: R (I, j) = sum (X (A, I)-xim) * (X (A, j)-xjm )) /(σ I * σ J * n) Where: R (I, j) is the matrix data of correlation coefficient, R (I, j) = Cov (I, j) /(σ I * σ J)

3: Calculate feature values and feature vectors

Calculate the feature value based on the feature Equation | r-λ I | = 0. The solution is RN * λ P + rn-1 * λ p-1 +... + R1 * λ + R0 = 0 feature polynomials, evaluate λ P, λ p-1 ,..., lambda, and arrange Lambda I by size, that is, Lambda> Lambda 1>...> λ P> 0 lists feature vectors LK = [lk1, lk2 ,..., LKP] when there are many trlk = λ LK variables, calculate the feature value and feature vector using the yake ratio method.

4: Calculate the contribution rate λ K/sum (λ I) and cumulative contribution rate sum (λ K/sum (λ I). Generally, the cumulative contribution rate reaches 85% ~ The feature value of 95 is the corresponding component of λ 1, λ 2,..., λ M (<p ).

5: Calculate the Principal Component Load P (zk, xi) = SQRT (λ K) * lki (I =,..., P; k =,..., m)

6. Calculate the principal component score based on the formula below to obtain the principal component matrix.

Z1 = l11*xn1 + l12*xn2 + ... + l1p*xnpZ2 = l21*xn1 + l22*xn2 + ... + l2p*xnp...  ...       ...       ...   ...Zm = lm1*xn1 + lm2*xn2 + ... + lmp*xnp

For the Matrix to solve the feature value and feature vector reference previous article, address: http://blog.csdn.net/liminlu0314/article/details/8957155

The following is the main code of gdal for PCA processing. As mentioned earlier, you may not compile the Code directly. If you have read the code, you will certainly be able to debug it yourself. The inverse transformation of the principal component is similar to that of the positive transformation, except that the matrix of the positive transformation is calculated by data, and the inverse transformation matrix is obtained by using the matrix of the positive transformation. The two are just a little different, and the others are exactly the same. The Code was written several years ago. There should be no bugs. First, the header file:

/*************************************** ************************************* Time: 2010-05-14 * Project: remote sensing algorithm * purpose: PCA transform * Author: Li minlu * copyright (c) 2010, liminlu0314@gmail.com * describe: provides the PCA conversion algorithm ************************************ **************************************** */# ifndef pcatransform_h # define pcatransform_h/*** \ file pcatransform. H * @ brief image PCA transformation ** for image PCA transformation */# include "martixalgo. H "// used to solve class gdaldataset such as feature values in a matrix;/*** \ class cpcatransform pcatransform. H * @ brief Principal Component transformation Principal Component Analysis ** is used to complete the principal component transformation of data Principal Component Analysis (PCA) ** principle of the principal component transformation algorithm is: \ n mainly refers to the statistical method that converts multiple original indicators into a few comprehensive indicators. * That is to say, use as few indicators as possible to reflect as many indicators as possible, and these few indicators * are independent. \ N if there are P indicators, x1, x2,..., XP, they are combined into M (<p) indicators, Z1, Z2,..., ZM. * The geometric meaning of principal component transformation is to find out the spindle problem of the elliptical body in the p-dimensional space. It is easily known from mathematics that * They are x1, x2 ,..., feature vectors corresponding to m large feature values in the XP correlation matrix. The general steps for \ n ** Principal Component transformation are as follows: \ n ** 1: normalize the original data. After normalization, the mean value of the normalized data is 0, the variance is 1 (I personally think that for * remote sensing data, the dimensions of all bands are the same. It is also possible to skip the first step without normalization) \ n * \ verbatim normalization equation: Xn (A, j) = (X (A, j)-xjm)/σ J Where: Xn (A, J) for normalized data, X (A, j) is the original data, xjm is the mean of the original data, and σ J is the standard deviation of the original data xjm = sum (X (A, j )) /n σ J * σ J = sum (X (A, j)-xjm) * (X (A, j)-xjm)/n \ endverbatim ** 2: calculation of correlation coefficient matrix R * \ verbatim correlation coefficient calculation equation: R (I, j) = sum (X (A, I)-xim) * (X (, j)-xjm)/(σ I * σ J * n) Where: R (I, j) is the matrix data of correlation coefficient, R (I, J) = Cov (I, j)/(σ I * σ J) \ endverbatim ** 3: calculate the feature value and feature vector * \ verbatim according to the feature Equation | r-λ I | = 0 calculate the feature value I .e. the solution: rn * λ P + rn-1 * λ p-1 +... + R1 * λ + R0 = 0 feature polynomials, evaluate λ P, λ p-1 ,..., lambda, and arrange Lambda I by size, that is, Lambda> Lambda 1>...> λ P> 0 lists feature vectors LK = [lk1, lk2 ,..., LKP] when there are many trlk = λ LK variables, calculate the feature value and feature vector \ endverbatim ** 4: Calculate the contribution rate λ K/sum (λ I) and the cumulative contribution rate sum (λ K/sum (λ I). Generally, the cumulative * contribution rate reaches 85% ~ 95% feature value λ 1, λ 2 ,..., λ M (<p) corresponds to the component ** 5: Calculate the Principal Component Load * P (zk, xi) = SQRT (λ K) * lki (I = 1, 2 ,..., p; k = 1, 2 ,..., m) ** 6: Calculate the principal component score based on the formula below and obtain the principal component matrix * \ verbatimz1 = L11 * xn1 + L12 * xn2 +... + l1p * xnpz2 = l21 * xn1 + l22 * xn2 +... + l2p * xnp ............... ZM = lm1 * xn1 + lm2 * xn2 +... + LMP * xnp \ endverbatim */class cpcatransform {public:/*** @ brief constructor, used for Principal Component transformation * @ Param pszsrcfile file path to be transformed * @ Param pprocess progress pointer */cpcatransform (const C Har * pszsrcfile, cprocessbase * pprocess = NULL);/*** @ brief destructor */~ Cpcatransform (); /*** @ brief PC transformation * @ Param pszpcafile output path of the file after principal component transformation * @ Param ibandcount the number of bands of the file after principal component Transformation (default value: All-1) * @ Param biscovariance is calculated using the correlation coefficient or variance-covariance matrix. The default value is the covariance matrix * @ Param bislikeenvi. The calculated result is output in the ENVI mode, that is, all data minus the mean, make the average value of each band 0 * @ Param pszformat output file format. The default format is geotiff format * @ return code */INT executepct (const char * pszpcafile, int ibandcount =-1, bool biscovariance = true, bool bislikeenvi = true, const char * pszforma T = "gtiff "); /*** @ brief PC inverse transformation * @ Param pszpcafile output path of the file after principal component transformation * @ Param pmmatrix feature vector matrix * @ Param pvmeanvector mean vector of the original image, it can be null * @ Param pszformat output file format. The default format is geotiff * @ return code */INT executeinversepct (const char * pszpcafile, const matrix * pmmatrix, const vector * pvmeanvector = NULL, const char * pszformat = "gtiff "); /*** @ brief obtain the transformation matrix and vector of PCA transformation * @ Param meigenvectors feature vector matrix * @ Param vmeanvalues mean vector */ Void getpcamatrix (Matrix & meigenvectors, vector & vmeanvalues); Private:/*** @ brief data preprocessing, * @ return code */INT preprocessdata ();/*** @ brief calculates the covariance matrix and correlation coefficient matrix R, step 2 * @ return code */INT calccovariancemartix ();/*** @ brief calculates the feature value and feature vector. Step 3 calculates the contribution rate and cumulative contribution rate, step 4 */void calceigenvalueandeigenvector ();/*** @ brief calculates the principal component score and writes it to the file, step 5 and Step 6 * @ Param pszpcafile output the path of the file after the principal component transformation * @ Param ibandcount the number of bands of the file after the principal component Transformation (default: All-1 )* @ Param pszformat output file format * @ return code */INT createpcafile (const char * pszpcafile, int ibandcount, const char * pszformat);/*** @ brief calculates the principal component score, and write it to the file. Step 5 and Step 6 * @ Param pszpcafile output the path of the file after principal component transformation * @ return code */INT calcsubavg (const char * pszpcafile); Private: cprocessbase * m_pprocess;/* <! Progress pointer */const char * m_pszsrcfile;/* <! File Path to be changed */bool m_biscovariance;/* <! PCA transformation method. True indicates covariance, and false indicates correlation coefficient */gdaldataset * m_psrcds;/* <! File pointer to be converted */INT m_ibandcount;/* <! Number of bands */double * m_pbandmean;/* <! Band mean */double * m_pbandstad;/* <! Band standard deviation */double * m_prelativity;/* <! Element */symmmatrix m_relmatrix;/* <! Correlation Coefficient Matrix */vector m_eigenvalues;/* <! The feature value of the correlation coefficient matrix */matrix m_eigenvectors;/* <! Feature vector of the correlation coefficient matrix */matrix m_pem;/* <! Construct the selected feature vector matrix */}; # endif/* pcatransform_h */

The next step is the implementation of the PCA transformation class. The implementation steps are the same as described above and can be understood in comparison to the above steps. For the function code section, the parameters and class variables of each function in the header file are described in detail, for more information about matrix operations and definition in the code, see the description in the MTL library. The following code indicates that after the PCA Transformation Result of ENVI is subtracted from the mean value of all bands, the mean value of the image after PCA transformation processed by ENVI is 0, ERDAS does not process this step. The above code has a parameter called bislikeenvi. If it is set to true, the processing result is the same as that of ENVI.

/*************************************** ************************************* Time: 2010-05-14 * Project: remote sensing algorithm * purpose: PCA transform * Author: Li minlu * copyright (c) 2010, liminlu0314@gmail.com * describe: provides the PCA conversion algorithm ************************************ **************************************** */# include "pcatransform. H "# include" gdal_priv.h "// you only need the gdal header file. cpcatransform: cpcatransform (const char * pszsrcfile, cproce Ssbase * pprocess) {m_pbandmean = NULL; m_pbandstad = NULL; m_prelatiance = NULL; m_psrcds = NULL; m_biscovariance = true; Signature = pszsrcfile; m_pprocess = pprocess;} cpcatransform ::~ Cpcatransform () {If (m_psrcds! = NULL) gdalclose (gdaldataseth) m_psrcds); release (m_pbandmean); release (m_pbandstad); release (m_prelatiister);} int identifier: preprocessdata () {gdalallregister (); m_psrcds = (gdaldataset *) gdalopen (m_pszsrcfile, ga_readonly); If (m_psrcds = NULL) {If (m_pprocess! = NULL) m_pprocess-> setmessage ("the input file cannot be opened! "); Return re_filenotexist;} m_ibandcount = m_psrcds-> getrastercount (); m_pbandmean = new double [m_ibandcount]; m_pbandstad = new double [m_ibandcount]; for (INT I = 1; I <= m_ibandcount; I ++) // obtain the mean and standard deviation of each band {double dmaxvalue, dminvalue; m_psrcds-> getrasterband (I)-> computestatistics (false, & dminvalue, & dmaxvalue, m_pbandmean + (I-1), m_pbandstad + (I-1), null, null);} return re_success;} int cpcatransform: calccov Ariancemartix () {If (m_psrcds = NULL) {If (m_pprocess! = NULL) m_pprocess-> setmessage ("the input file cannot be opened! "); Return re_filenotexist;} int iwidth = m_psrcds-> getrasterxsize (); int iheight = m_psrcds-> getrasterysize (); int iimagesize = iwidth * iheight; int ielementnum = m_ibandcount * (m_ibandcount + 1)/2; // The number in the correlation coefficient matrix. Only the diagonal matrix is saved, because the real symmetric array int ielementindex = 0; // index m_prelativity = new double [ielementnum] used to traverse the element in the correlation coefficient matrix; // allocate the size of the correlation coefficient matrix symmmatrix covmatrix (m_prelativity, m_ibandcount, m_ibandcount ); // correlation coefficient matrix for (INT I1 = 1; I 1 <= m_ibandcount; I1 ++) {for (INT I2 = 1; I2 <= m_ibandcount; I2 ++) {If (I2 <I1) continue; if (I1 = I2) // if it is the same band, {If (! M_biscovariance) m_prelativity [ielementindex] = 1.0; // correlation matrix elsem_prelativity [ielementindex] = m_pbandstad [i1-1] * m_pbandstad [i1-1]; // variance-covariance matrix ielementindex +; continue;} gdalrasterband * frequency = m_psrcds-> getrasterband (I1); gdalrasterband * frequency = m_psrcds-> getrasterband (I2); dt_64f * pbuff1 = new dt_64f [iwidth]; dt_64f * pbuff2 = new dt_64f [iwidth]; double dtemp = 0.0; For (Int J = 0; j <iheight; j ++) // row {Ptrbandi1-> rasterio (gf_read, 0, J, iwidth, 1, pbuff1, iwidth, 1, gdt_float64, 0, 0 ); // read the first-band data block ptrbandi2-> rasterio (gf_read, 0, J, iwidth, 1, pbuff2, iwidth, 1, gdt_float64, 0, 0 ); // read the second band data block for (INT I = 0; I <iwidth; I ++) dtemp + = (pbuff1 [I]-m_pbandmean [i1-1]) * (pbuff2 [I]-m_pbandmean [i2-1]);} release (pbuff1); release (pbuff2); m_prelati.pdf [ielementindex] = dtemp/iimagesize; // variance-covariance matrix if (! M_biscovariance) m_prelativity [ielementindex] = m_prelativity [ielementindex]/(m_pbandstad [i1-1] * m_pbandstad [i2-1]); // correlation coefficient matrix ielementindex ++;} m_relmatrix = covmatrix; calceigenvalueandeigenvector (); // calculate the feature value and feature vector return re_success;} void cpcatransform: calceigenvalueandeigenvector () {/************************************** ********************************** // matrix feature value and feature vector *//*********************** **************************************** * *******/Matrix (m_ibandcount, m_ibandcount); For (INT I = 0; I <m_ibandcount; I ++) {for (Int J = 0; j <m_ibandcount; j ++) matrix (I, j) = m_relmatrix (I, j);} vector eigenvalues (m_ibandcount); matrix eigenvectors (m_ibandcount, m_ibandcount); vector contritors (m_ibandcount); vector acccontribute (m_ibandcount); Aggregate (matrix, eigenvalues, eigenvectors, & contribu Te, & acccontries, 0.0001); m_eigenvalues = eigenvalues; m_eigenvectors = eigenvectors;} int attributes: createpcafile (const char * pszpcafile, int ibandcount, const char * pszformat) {If (m_psrcds = NULL) {If (m_pprocess! = NULL) m_pprocess-> setmessage ("the input file cannot be opened! "); Return re_filenotexist;} int iwidth = m_psrcds-> getrasterxsize (); int iheight = m_psrcds-> getrasterysize (); int inewbandcount = m_ibandcount; If (ibandcount> 0) inewbandcount = ibandcount; matrix PEM (m_ibandcount, inewbandcount); // construct the selected feature vector matrix for (INT it = 0; it <inewbandcount; it ++) {for (INT Jt = 0; JT <m_ibandcount; JT ++) PEM (JT, it) = m_eigenvectors (JT, it);} m_pem = PEM; // construct the selected feature vector matrix return linearcombinati On (m_pszsrcfile, pszpcafile, & m_pem, null, pszformat, m_pprocess);} int cpcatransform: calcsubavg (const char * pszpcafile) {gdalallregister (); gdaldataset * PPS = (gdaldataset *) gdalopen (pszpcafile, ga_update); If (PDS = NULL) {If (m_pprocess! = NULL) m_pprocess-> setmessage ("the file cannot be opened! "); Return re_filenotsupport;} int iwidth = PPS-> getrasterxsize (); int iheight = PDS-> getrasterysize (); int ibandcount = PDS-> getrastercount (); if (m_pprocess! = NULL) m_pprocess-> setstepcount (iheight * ibandcount); dt_32f * pbuff = new dt_32f [iwidth]; for (INT it = 1; it <= ibandcount; it ++) {gdalrasterband * pband = PDS-> getrasterband (it); double dfmin = 0.0, dfmax = 0.0, dfmean = 0.0, dfstddev = 0.0; pband-> computestatistics (false, & dfmin, & dfmax, & dfmean, & dfstddev, null, null); For (INT I = 0; I <iheight; I ++) {pband-> rasterio (gf_read, 0, i, iwidth, 1, pbuff, iwidth, 1, Gdt_float32, 0, 0); For (Int J = 0; j <iwidth; j ++) pbuff [J] = pbuff [J]-(float) dfmean; pband-> rasterio (gf_write, 0, I, iwidth, 1, pbuff, iwidth, 1, gdt_float32, 0, 0); If (m_pprocess! = NULL) m_pprocess-> stepit ();} // end} gdalclose (gdaldataseth) PDS); release (pbuff); string STR = string (pszpcafile) + ". aux. XML "; remove (Str. c_str (); Return variable;} int variable: executepct (const char * pszpcafile, int ibandcount, bool biscovariance, bool bislikeenvi, const char * pszformat) {m_biscovariance = biscovariance; if (m_pprocess! = NULL) m_pprocess-> setmessage ("start to execute Principal Component transformation..."); int irev = preprocessdata (); If (irev! = Re_success) return irev; irev = calccovariancemartix (); If (irev! = Re_success) return irev; irev = createpcafile (pszpcafile, ibandcount, pszformat); If (irev! = Re_success) return irev; If (bislikeenvi) {irev = calcsubavg (pszpcafile); If (irev! = Re_success) return irev;} If (m_pprocess! = NULL) m_pprocess-> setmessage ("computing completed! "); Return re_success;} int cpcatransform: executeinversepct (const char * pszpcafile, const matrix * pmmatrix, const vector * pvmeanvector, const char * pszformat) {If (m_pprocess! = NULL) m_pprocess-> setmessage ("start to execute inverse Principal Component transformation..."); If (pmmatrix = NULL | pszpcafile = NULL) {If (m_pprocess! = NULL) m_pprocess-> setmessage ("the specified file name is empty or the matrix is empty... "); Return re_paramerror;} matrix invmatrix (pmmatrix-> nrows (), pmmatrix-> ncols (); If (! Inversematrix (* pmmatrix, invmatrix) {If (m_pprocess! = NULL) m_pprocess-> setmessage ("the specified matrix does not have an inverse matrix... ");} int irev = linearcombination (m_pszsrcfile, pszpcafile, & invmatrix, pvmeanvector, pszformat, m_pprocess); If (irev! = Re_success) return irev; If (m_pprocess! = NULL) m_pprocess-> setmessage ("computing completed! "); Return re_success;} void cpcatransform: getpcamatrix (Matrix & meigenvectors, vector & vmeanvalues) {meigenvectors = m_eigenvectors; vmeanvalues. resize (m_ibandcount, 0); For (INT I = 0; I <m_ibandcount; I ++) vmeanvalues [I] = m_pbandmean [I];}

Finally, I will list several references of PCa. For PCA, This is a mathematical thing. For image processing, we can treat each band as a set of data, then, find the primary and secondary components of the data. It is actually data.

References:

[1] http://baike.baidu.com/view/45376.htm

[2] http://zh.wikipedia.org/wiki/%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90

[3] http://en.wikipedia.org/wiki/Principal_component_analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.