The data used for the following analysis can be found here (
"Href =" http://dl.getdropbox.com/u/308058/blog/raw_data_3_replicates.txt "> http://dl.getdropbox.com/u/308058/blog/raw_data_3_replicates.txt) download, the data comes from a study on Gene Transfer of butterflies, 20 sample butterfly individuals, 10 of which are local inherent (old ), the other 10 are newly migrated individuals (new), old and new individuals are randomly paired and marked with different color dyes (555 and nm respectively, in addition, each gene repeats the sample three times on each chip. Therefore, this data includes three replicates and ten dual-channel chips. The data is the signal strength value of the sample, which has not been standardized.
When you get the data, you will see a lot of "Na", because I replaced the missing Blank Value with Na, so that we can fill in the missing value with R.
SpeakingThere are usually three methods to fill in Missing Values:
A. Fill the gene with the average expression value of this gene. If there are multiple duplicate chips, you can take the average value on different chips. for time series chips, you can use interpolation. This method is very simple and common, but the effect is not as good as the following two methods
B. Filling Based on the SVD (single-value decomposition) method: Simply put, this method fills missing values by describing several basic modes of gene expression profiles.
C. filling Based on KNN (nearest neighbor) method: This method is used to search for other genes with the expression profile similar to those with missing values, by the expression values of these genes (weighted according to the expression profile similarity) to fill in the missing values. KNN is the best among the three methods. Therefore, the missing value of this data is filled with KNN.
For the comparison of the above three methods, this paper provides a clear description: troyanskaya, O ., cantor, M ., sherlock, G ., brown, P ., hastie, T ., tibshirani, R ., botstein, D ., and Alman, R. b. (2001), missing value estimation methods for DNA microarrays,Bioinformatics17 (6): 520-525. KNN is described as follows:
The KNN-based method selects genes with expression pro into les similar to the gene of interest to impute missing values. if we consider gene A that has one missing value in experiment 1, this method wocould encrypt nd K other genes, which have a value present in experiment 1, with expression most similar to a in experiments 2-N (where N is the total number of experiments ). A weighted average of values in Experiment 1 from the K closest genes is then used as an estimate for the missing value in gene.
In the weighted average, the contribution of each gene is weighted by similarity of its expression to that of gene.
Analyze the data below
First installR(Http://www.r-project.org /)
Download and installImputePackage
> Source ("http://bioconductor.org/biocLite.R ")
> Bioclite ("impute ")
Impute is an R package dedicated to filling missing values with KNN:
Set the current working directory (Windows is in the menu bar of R-> file-> change working directory... Settings, using the setwd () function in Linux)
Enter the following in the r console:Code:
Library (impute)
# Import impute package
Raw <-read.table('raw_data_3_replicates.txt ', header = true)
Rawexpr <-Raw [,-1]
# Remove the ID column of the First Column
If (exists (". Random. Seed") Rm (. Random. Seed)
# Required. If this sentence is not provided, an error will occur. I do not know the reason-,-please kindly advise
Imputed <-impute. KNN (as. Matrix (rawexpr), k = 10, rowmax = 0.5, colmax = 0.8, maxp = 1500, RNG. Seed = 362436069)
# Impute. KNN () uses a matrix as the first parameter. The default value is used for other parameters.
Write.table(imputed1_data,file='imputed_data.txt ')
# Write. Table () stores the data in the file in the current working directory. The file name is specified with file = ''. This step is not required.
Imputeddata <-imputed $ data
# Imputed $ data is the matrix of imputed data stored in R.
Now input imputed in R, that is, the filled data matrix. Are all na values missing?
Detailed documentation about impute package in http://bioconductor.fhcrc.org/packages/release/bioc/html/impute.html
All data files: http://files.cnblogs.com/emanlee/R_bioconductor_genechip_data_process.zip
From: http://azaleasays.com/tag/r/