Analysis of Gene Chip data using R and biocondu( IV): Chip Normalization

Source: Internet
Author: User

Next article: using R and biocondufor gene chip data analysis (3): computing Median

Normalization is translated from normalization. The purpose of normalization is to make measurements under various/group measurements or under various experimental conditions can be compared with each other to eliminate non-experimental differences between measurements. Non-experimental differences may come from sample preparation, sample, hybridization, and signal processing.

There are many methods for normalization, and there are also differences between the oligarchy chip (single channel, represented by Affymetrix) and the cDNA chip (dual channel, red-green dye. The following discussion applies to dual-channel chips. Of course, it may also apply to single-channel chips.

Normalization is usually divided into bulk normalization and control-based normalization. the former assumes that only a small part of gene expression values are different under different conditions, so all genes are used as the standard for normalization; the latter uses the control genes whose expression values are considered unchanged as the standard for normalization. The assumptions of these two methods may not always be true. Therefore, we need to analyze the specific situation.

"Bulk" Normalization is subdivided into many methods, the simplest of which is global normalization. This method assumes that the signal strength of red dyes is positively correlated with the signal strength of green dyes, that is, r = kg (r: red signal strength, G: green signal strength, K: constant ), therefore, the base-2 logarithm ratio of the signal strength is equivalent to translating a constant C = logk: log (R/g) → log (R/g) after normalization) -C = Log (R/g)-logk = log [R/(kg)].
C is usually calculated by dividing all red signal intensity and all green signal intensity and then taking the base 2 logarithm, that is, c = log [Total (r) /total (g)]

Intensity-Dependent normalization is generally better than global normalization, because the latter's assumptions are generally not completely correct. Generally, log ratio is related to the signal strength value, that is, log (R/g) → log (R/g)-C (a). Here a = 1/2 * log (R * g ), is log product intensity or log intensity.

For our data mediandata, We can intuitively see the relationship between log ratio and signal strength:

 

The vertical coordinate is M = Log (R/G) = Log (New/old), and the horizontal coordinate is a = 1/2 * log (R * g) = 1/2 * log (New * old ).
The blue curve is the lowess regression function (what is lowess ). (Note: Because there are 5 rows of raw data with 0 values, some M, A = inf or-INF data cannot be returned by lowess. Therefore, these 5 rows are manually deleted, mediandata after processing is downloaded here. Of course, you can also use the raw data to find the m and a values, and delete the rows in the mediandata corresponding to the INF value.

Drawn in RCode:

Mediandata <-read.table('mediandata.txt ', header = true)
Mediandata <-mediandata [,-1] # Remove the ID column of the First Column
MA <-matrix (Data = Na, nrow = dim (mediandata) [1], ncol = dim (mediandata) [2], byrow = true, dimnames = NULL)
New <-0
Old <-0
For (I in 1: dim (mediandata) [1]) {
For (J in 1 :( dim (mediandata) [2]/2 )){
New <-mediandata [I, 2 * J-1]
Old <-mediandata [I, 2 * j]
Ma [I, 2 * j] <-log (New/old)/log (2) # M = Log (New/old)/log2
Ma [I, 2 * J-1] <-1/2 * log (New * Old)/log (2) # A = 1/2 * log (New * Old)/log2
}
}
Plot (MA [, 1], ma [, 2], xlab = 'A', ylab = 'M ')
Lines (lowess (MA [, 1], ma [, 2], F = 0.2, iter = 2), LWD = 2, Col = 'blue ')

# Draw only the MA diagram of the two hybridization samples on the first chip. Use ma [, 3], ma [, 4] to draw a diagram of 2nd chips.

We can see that the log ratio of the raw data is affected by log intensity. Therefore, intensity-based normalization is required.
R's lowess function returns a $ y object, storing the ~ corresponding to each a value ($ X object returned by lowess ~ M value, and the normalized M' = m -~ M = m-$ Y ($ X)

After normalization, the log ratio of 10 chips is obtained, that is, 10 columns of data. However, for the convenience of subsequent analysis, what should I do if I want to get 20 columns of data with 10 new and 10 old values?

The answer is very simple. Assume that the intensity value of new remains unchanged after normalization, and only change the intensity value of old, get old '= old * 2 ^ ($ Y ($ X). Note: $ Y ($ X) is an index of 2, and the derivation is simple.

The following is the R code:
Normed <-matrix (Data = Na, nrow = dim (MA) [1], ncol = dim (MA) [2], byrow = true, dimnames = NULL) # New-odd sequence; old-even Column
For (J in 1 :( dim (MA) [2]/2 )){
Out_lowess <-lowess (MA [, 2 * J-1], ma [, 2 * j], F = 0.2, iter = 2)
# A = ma [, 1], M = ma [, 2]
Loc_lowess <-cbind (out_lowess $ X, out_lowess $ Y)
For (I in 1: dim (MA) [1]) {
Normed [I, 2 * J-1] <-mediandata [I, 2 * J-1] # normalized ne'new 'intensity Value
Normed [I, 2 * j] <-mediandata [I, 2 * j] * 2 ^ (loc_lowess [, 2] [loc_lowess [, 1] = ma [I, 2 * J-1] [1]) # normalized old 'intensity Value
}
}

Take a look at the effect:

Manormed <-matrix (Data = Na, nrow = dim (MA) [1], ncol = 2, byrow = true, dimnames = NULL)
Manormed [, 2] <-log (normed [, 1]/normed [, 2])/log (2) # M = Log (New/old)/log2
Manormed [, 1] <-1/2 * log (normed [, 1] * normed [, 2])/log (2) # A = 1/2 * log (New * old) /log2
Plot (manormed [, 1], manormed [, 2], xlab = 'A', ylab = 'M ')
Lines (lowess (manormed [, 1], manormed [, 2], F = 0.2, iter = 2), LWD = 2, Col = 'blue ')

The lowess regression curve is a straight line, indicating that the normalized log ratio has nothing to do with the intensity value.

The effects of normalization can be viewed from another perspective:

 

Plot (density (normed [, 1]), type = 'line', Col = 'red', xlab = 'intensity ')
Points (density (normed [, 2]), type = 'line', Col = 'green ')
Points (density (mediandata [, 1]), type = 'line', Col = 'blue ')
Points (density (mediandata [, 2]), type = 'line', Col = 'black ')
Text (2.2, C (0.09, 0.11, 0.13, 0.15), labels = C ('before normalization Black', 'before normalization Blue', 'after normalization green ', 'After normalization red'), Col = C ('black', 'blue', 'green', 'red '))

For information on chip normalization, refer to the following materials:

Dual-channel chip:

Yang Y. H., dudoit S., luu P., speed T. P. (2001) normalization for cDNA Microarray data,Spie BIOS 2001, San Jose CA;

Yang y. H ., dudoit S ., luu P ., lin D. M ., peng v ., ngai J ., speed T. p. (2002), normalization for cDNA Microarray Data: a robust composite method addressing single and multiple slide systematic variation,Nucleus acids Research30 (4 );

Single-channel chip of oligarchy:

Bolstad B. M ., irizarry R. A ., astrand M ., speed T. p. (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance,Bioinformatics19 (2): 185-193

From: http://azaleasays.com/tag/r/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.