Autoencoder: Automatic encoder in machine learning, this article uses a de-noising encoder, known as Denoise Autoencoder (DAE), to remove dropout noise in sc-rnaseq is a very ideal model.
Therefore, this article has been published in the NC 18 pre-printed, proving that the method and the quality of the article is very good.
Fundamentals of DAE:
# # #measurement noise from dropout events, moves data points away from the data manifold (black line). The Autoencoder is trained to denoise the data by mapping corrupted data points back onto the data manifold (gree n arrows), the blue solid point represents the corruption points (variable posture point) is the observed point, and the Blue hollow dot represents a data point without noise. Manifold refers to the classification flow formed in manifold learning. Manifold Learning is also a very common and common machine learning method, which we will continue to introduce in the future.
The article describes the DCA takes the count distribution (number distribution), Overdispersion and sparsity of the data (over-discretization and sparsity) into the account using a ze ro-inflated Negative binomial noise model (0 expansion minus two models, followed by a further introduction to the ZINB model, because this puppy is very fond of it), and nonlinear Gene-gene (non-linear interaction between genes and genes) or gene-dispersion (interactions between genes and dispersion) interactions is captured.
Input counts, mean, dispersion and dropout probabilities are denoted as x,μ, θandπ. respectively. A typical autoencoders compresses high dimensional data into lower dimensions in order to constrain the model and extract Features that summarize the data well in the bottleneck layer.
Extension in machine learning:
###################
To test whether a zinb loss function was necessary, we compared DCA to a classical autoencoder with mean Squar The ED error (MSE) loss function using log transformed count data. The MSE based Autoencoder is unable to recover the celltypes, indicating that the specialized ZINB loss function is necessary for SCRNA-SEQ data.
The following are data without DP, data with DP, ZINB of the DCA loss function, and loss function method of mean square error after log in classic DCA. The main purpose is to highlight the model superiority of the loss function of the new 0 expansion negative two-item distribution, which is significantly superior to the classical function. Classification, dimensionality reduction and cluster analysis are confirmed in four cases. There are many diagrams showing the superiority of this method, the method of this paper, the drawing is not many tables.
Method
ZINB is parameterized with mean and dispersion parameters of the negative binomial component (μandθ) and the mixture Coe Fficient that represents the weight of the point mass (π): Here the Pi refers to the dropout gene expression count weight--estimates dropout
First, the library size of the gene expression matrix is adjusted, log value and Z-score normalization
And then using Xbar to calculate e B D represents the coding layer, the bottleneck layer and the different matrix of the decoding layer.
The first three steps in the LU decomposition of the matrix
The next three steps define the parameters according to the decomposed matrix, and D is the activation function matrix derived from the decoding layer.
The loss function is likelihood of zero-inflated negative binomial distribution:
According to the parameters obtained from the series of equations, the parameters of the loss function are estimated, and the estimation process is used in the ZINB distribution, where m represents the mean matrix with size factor, the formula is as follows:
At the same time, if you have different DP rate weights, you need to estimate it differently with the ridge priori probabilities of DP and 0 expansion parameters: The formula is as follows:
Where Nllzinb function represents the negative log likelihood of zinb distribution.
|| tt| | ^2 is the norm size of the TT in two-dimensional space vectors. Using ZINB's negative logical likelihood value to estimate the X TT M Seta becomes feasible.
in short, the advantage of this formula is that it can have a good parameter estimation effect in different levels or levels of gene dispersion . (cites:the alternative option estimates a scalar dispersion parameter per gene)
====================================
====================================
Many of the existing single-cell optimization analysis software are: Scimpute, MAGIC, DCA, SAVER, BEIR, etc
People interested can go to the search, to treat the SOCPE your knowledge
Single cell rna-seq denoising using a deep count autoencoder