Transferred from: http://www.gogoqq.com/ASPX/8390905/JournalContent/1303140588.aspx
Research the algorithm for nearly half a year, record to give oneself an explanation, also should be Test g before the last post of log.
Weighted Gene co-expression Network analysis of the Chinese name has been translated into weighted correlation networks, the feeling is not very appropriate, English is more direct. Originally is Shanhao from Old Wangna take of a subject, because looks more interesting on the article to find to slowly chew, to now is a bit of the fame. The method is presented by a professor at UCLA, who classifies it into systems biology research, although the individual thinks that because the level of analysis is still only on the DNA chip, it does not reach the level of the system, but the method itself can foresee some incisive idea. I would like to introduce the basic idea of the method, and then put the information published on the online sample data to complete the implementation of the algorithm, and explain some of the key problems they encountered when looking at.
Weighted Gene co-expression Network Analysis (hereinafter referred to as WGCNA) is an algorithm for mining modules (module) information from chip data. In this method, the module is defined as a group of genes with similar expression profiles, and if certain genes are always similar in a physiological process or in different tissues, then we have reason to think that these genes are functionally related and can be defined as a module. This seems somewhat similar to the results obtained by clustering, but the difference is that WGCNA's clustering criteria are biologically significant, rather than conventional clustering methods (such as the use of geometric distances between data), so the results obtained by this method have higher reliability. When the gene module is defined, we can use these results to do a lot of further work, such as associated traits (which will then be used for example), metabolic pathways modeling, establishing genetic interaction networks, and even EQTL (which is really handy, But the premise is that the experimental problem has the money to clutter up so many chips. But what I personally benefited most was the ability to deepen people's thinking about the Scale-free topology network regulation that the organism chooses (as mentioned below).
WGCNA The data analyzed is chip data (of course, a lot of chips, for example, to study apoptosis, then the use of the method requires the experimenter to provide the cell apoptosis at various stages of the chip data to understand the physiological process of all the gene expression changes in the cell).
In the Co-expression network, the expression of each gene at a specific time or space is seen as a point (node), which can be simply understood as a gene expression on a chip that is a node in the network. If we make 80 chips and have 8,000 genes on each chip, we can use a 80*8000 matrix to represent the results of the experiment. In order to get the relationship between genes, we need to calculate the correlation coefficients of any two genes (the article uses person coefficient), after this step, we can get a 8000*8000 real symmetric matrix S, Sij represents the coefficient of the person of the first and the J genes, that is, the expression spectrum similarity of two genes.
The next analysis is the first point of the method. In order to know whether the expression spectra of two genes are similar, it is necessary to artificially specify a threshold, and only when the person coefficient of the gene reaches this threshold (such as 0.8) do we think that the two genes are similar, otherwise they are not similar. For this reason people define a adjacency matrix, it is obvious that after processing the S matrix through the above steps, the resulting adjacency matrix will be a 0/1 matrix (it is important that the elements on the main diagonal of the matrix are specified as 0). But there is a clear limitation to this analysis that there is no reason to think that the two genes with person coefficient 0.8 are significantly different from the two genes that coefficient 0.79, but the above algorithm cannot avoid this situation. In Wgcna, a method based on soft threshold is adopted to avoid this problem. The idea of soft thresholds is to adjacency the elements in the matrix by the weights function (so the method is called Weighted Network), and the common weights include sigmoid functions and power functions.
So aij is also biodegradable, which is a good proof. The advantage of having this property is that it can simplify the calculation, when the real symmetric matrix can be decomposed, we only need to use the decomposition of a vector to represent the matrix, so in practical applications can simplify the computer's operational memory usage. It would be a bit simplistic to take the gene correlation indicator directly from adjacency matrix to identify the module, and in order to ensure the full use of the chip information, the author proposed to calculate another matrix--topological overlap matrix (TOM) To measure the correlation of two genes. The idea of this matrix is that any two genes are not directly related to their expression similarity, but also that the a gene is incorporated into the Tom Matrix value of the AC gene through the interaction of the B gene with the C gene to more accurately describe the similarity of the gene expression profile.
Using this approach to define the values of the elements in the Tom Matrix is very ingenious, and it is a good match for what we want to achieve. In the molecule for l-ij
The definition of the gene I adjacency by any gene associated with j, and add them, and aij
It indicates the direct association between Gene I and Gene J. The denominator definition guarantees the wij
Always between 0 and 1, we can consider extreme situations. When all elements except the main diagonal element in the adjacency matrix are 1 o'clock,
It is not easy to read the above, and it is easy to understand when adding and expressing expressions.
It is important to note that the Wgcna method only considers first-order gene associations, and more advanced associations can be expressed in a similar way, but not necessarily. First, the chip data itself is noisy, the excessive extraction of information may not be better results, and the calculation of higher-order association will make the complexity of the algorithm significantly increased, even if the high-configuration server may not be able to meet the calculation requirements.
To facilitate the subsequent module identification, it is also necessary to define a dissimilarity matrix. According to previous research, the way is as follows:
where dij
Represents a dissimilarity matrix in which the equation is only 1 minus wij
Get dissimilarity matrix, in dij
The reason for the Upper plus index is the result of the empirical study, when clustering analysis using an exponential form of dissimilarity will get more distinct gene module. After getting the dissimilarity matrix, we need to do the work is clustering, the article uses the hierarchical clustering method, the merits of various clustering methods are not in the scope of this article. After using cluster analysis, the identification of each module is completed.
Understanding the entire analysis process, it is necessary to take a closer look at the details of the analysis.
The first is the parameter selection of the weight function, because the power functions have one parameter
The choice of this parameter is bound to affect the result of module identification.
In order to select an appropriate parameter value, it is necessary to re-examine the structure of the genetic interaction network. The mathematical name of the network is a graph, in graph theory for each node has an important concept, namely: degrees. The degree of a point refers to the number of edges associated with that point in the graph. For example, if you don't think about it, it's easy to think that a common network in life is a random net, that is, the relative average of each node's degree. However, the second diagram, the Scale-free network, is a more stable choice. The Scale-free network has the feature that there are a few nodes that are significantly higher than the average point, and these points are called hubs. A small number of hubs are associated with other nodes, eventually forming the entire network. The number of nodes in such a network is subject to power distribution between the numbers of the nodes with that degree. This provides a theoretical basis for us to find the best parameters. To do a little expansion here, I think the time is very necessary. As long as we are willing to abstract, Scale-free network exists in a large number of lives. People's social network, biological gene protein interaction, computer network and even sexually transmitted diseases have this relationship. The organism chooses Scale-free network instead of the random network to have its evolutionary reasons, apparently for the Scale-free network, a few key genes perform the main function, which has very good robustness, That is, as long as the integrity of the hub, the basic activities of the entire life system will not be affected by the impact of a certain stimulus, and if the random network is stimulated by the outside, its damage will be directly proportional to the intensity of stimulation.
Random Network (a) and Scale-free network (b). In the Scale-free network, the larger hubs is highlighted. Image source: Http://en.wikipedia.org/wiki/File:Scale-free_network_sample.png with this theoretical basis, we can try a series of parameter values for the weight function, such as
And then find out that the network that best fits the frequency distribution of scale-free networks is such a
Used as a follow-up analysis. However, in the actual search process, there is a trade-off, that is, in maximizing the topology free network regression coefficients R2
With the number of connections to ensure the relationship between the two, this can actually establish an optimization model, but the author of the article did not continue in-depth study to obtain the objective criteria for seeking reference. Although the author has established a Scale-free topology criterion, the analysis process still has a large subjective component. Post-module analysis when the module is established, it is necessary to define a characteristic gene in each module in order to facilitate the processing of the correlation between module and other data, such as trait information, which can represent the characteristics of this module under the acceptable information loss degree. One of the great benefits of doing this is that it simplifies calculations, even when the volume of data is extremely large, to get results quickly. In the subsequent analysis, the authors also compared the Hard/soft-threshold method to establish the clustering coefficient of the network, and their influence on the network connectivity, This analysis is intended to illustrate the advantages of the soft-threshold approach compared to hard-threshold, and is not discussed further here because of the knowledge involved in the deeper graph theory and the establishment of the module. References: Bin Zhang, Steve Horvath, A General Framework for Weighted Gene co-expression Network analysis, Statistical APPL Ications in genetics and Molecular Biology, departments of Human Genetics and biostatistics, University of California at L Os Angeles, 2005, Volume 4, Issue 1, article 17.
WGCNA Algorithm Research Notes