Analysis of Gene Chip data using R and biocondu( 2): Missing Value Filling

Source: Internet
Author: User

The data used for the following analysis can be found here (

"Href =" http://dl.getdropbox.com/u/308058/blog/raw_data_3_replicates.txt "> http://dl.getdropbox.com/u/308058/blog/raw_data_3_replicates.txt) download, the data comes from a study on Gene Transfer of butterflies, 20 sample butterfly individuals, 10 of which are local inherent (old ), the other 10 are newly migrated individuals (new), old and new individuals are randomly paired and marked with different color dyes (555 and nm respectively, in addition, each gene repeats the sample three times on each chip. Therefore, this data includes three replicates and ten dual-channel chips. The data is the signal strength value of the sample, which has not been standardized.

When you get the data, you will see a lot of "Na", because I replaced the missing Blank Value with Na, so that we can fill in the missing value with R.

SpeakingThere are usually three methods to fill in Missing Values:

A. Fill the gene with the average expression value of this gene. If there are multiple duplicate chips, you can take the average value on different chips. for time series chips, you can use interpolation. This method is very simple and common, but the effect is not as good as the following two methods

B. Filling Based on the SVD (single-value decomposition) method: Simply put, this method fills missing values by describing several basic modes of gene expression profiles.

C. filling Based on KNN (nearest neighbor) method: This method is used to search for other genes with the expression profile similar to those with missing values, by the expression values of these genes (weighted according to the expression profile similarity) to fill in the missing values. KNN is the best among the three methods. Therefore, the missing value of this data is filled with KNN.

For the comparison of the above three methods, this paper provides a clear description: troyanskaya, O ., cantor, M ., sherlock, G ., brown, P ., hastie, T ., tibshirani, R ., botstein, D ., and Alman, R. b. (2001), missing value estimation methods for DNA microarrays,Bioinformatics17 (6): 520-525. KNN is described as follows:

The KNN-based method selects genes with expression pro into les similar to the gene of interest to impute missing values. if we consider gene A that has one missing value in experiment 1, this method wocould encrypt nd K other genes, which have a value present in experiment 1, with expression most similar to a in experiments 2-N (where N is the total number of experiments ). A weighted average of values in Experiment 1 from the K closest genes is then used as an estimate for the missing value in gene.
In the weighted average, the contribution of each gene is weighted by similarity of its expression to that of gene.

 

Analyze the data below

First installR(Http://www.r-project.org /)

Download and installImputePackage

> Source ("http://bioconductor.org/biocLite.R ")

> Bioclite ("impute ")

Impute is an R package dedicated to filling missing values with KNN:

Set the current working directory (Windows is in the menu bar of R-> file-> change working directory... Settings, using the setwd () function in Linux)

Enter the following in the r console:Code:

Library (impute)
# Import impute package
Raw <-read.table('raw_data_3_replicates.txt ', header = true)

Rawexpr <-Raw [,-1]
# Remove the ID column of the First Column
If (exists (". Random. Seed") Rm (. Random. Seed)

# Required. If this sentence is not provided, an error will occur. I do not know the reason-,-please kindly advise
Imputed <-impute. KNN (as. Matrix (rawexpr), k = 10, rowmax = 0.5, colmax = 0.8, maxp = 1500, RNG. Seed = 362436069)
# Impute. KNN () uses a matrix as the first parameter. The default value is used for other parameters.
Write.table(imputed1_data,file='imputed_data.txt ')

# Write. Table () stores the data in the file in the current working directory. The file name is specified with file = ''. This step is not required.
Imputeddata <-imputed $ data
# Imputed $ data is the matrix of imputed data stored in R.

Now input imputed in R, that is, the filled data matrix. Are all na values missing?

Detailed documentation about impute package in http://bioconductor.fhcrc.org/packages/release/bioc/html/impute.html

All data files: http://files.cnblogs.com/emanlee/R_bioconductor_genechip_data_process.zip

From: http://azaleasays.com/tag/r/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.