This article mainly introduces an effective method of dimensionality reduction in T-sne.
In the age of big data, data volumes are not only exploding, data is becoming more complex, and the dimensions of data are increasing. For example, a large picture, the data dimension refers to the order of magnitude of pixels, ranging from thousands of to millions of. The computer can handle any multidimensional data set, but our human cognition is confined to 3 dimensional space, the computer still needs us (thankfully), so we need some methods to effectively visualize the high-dimensional data. By observing real-world datasets, it is found that there are some lower intrinsic dimensions. Imagine that you are taking a camera in a photo, you can think of each picture as a 16,000,000-dimensional point (assuming the camera is 16 pixels), but these photos are approximately distributed in three-dimensional space, this low-dimensional space using a complex, non-linear method embedded in the high-dimensional space, this structure is hidden in the data, Only by certain mathematical analysis method to restore.
Here are the popular learning methods, or the non-linear dimensionality reduction method, a branch of ML (more specifically, unsupervised learning). This article mainly talks about a popular data dimensionality reduction method T-sne, presented by Laurens Vander Maaten and Geoffrey Hinton (original article). The algorithm is successfully applied to many real datasets. Here, we deal with the handwritten digital recognition library mnist, using Python and Scikit-learn, according to the original idea.
Visual Mnist
First import some libraries.
Import NumPy as NP from
numpy import linalg from
numpy.linalg import norm from
scipy.spatial.distance import s Quareform, Pdist
Re-import Sklearn
Import Sklearn from
sklearn.manifold import tsne from
sklearn.datasets import load_digits
from Sklearn.preprocessing Import scale from
sklearn.metrics.pairwise import pairwise_distances
from Sklearn.manifold.t_sne Import (_joint_probabilities,
_kl_divergence) from
Sklearn.utils.extmath Import _ Ravel
Random Status value
RS = 20150101
Using the graphics library Matplotlib
Import Matplotlib.pyplot as Plt
import matplotlib.patheffects as patheffects
import matplotlib
% Matplotlib Inline
Better drawing with Seaborn
Import Seaborn as SNS
sns.set_style (' Darkgrid ')
sns.set_palette (' muted ')
Sns.set_context ("Notebook ", font_scale=1.5,
rc={" lines.linewidth ": 2.5})
Creating animations using Matplotlib and Moviepy
From moviepy.video.io.bindings import mplfig_to_npimage
import moviepy.editor as Mpy
Loading handwritten digital recognition library, total 1797 pictures, each size 8x8
digits = Load_digits ()
digits.data.shape
print (digits[' DESCR '])
Data Set Characteristics:
: Number of instances:5620
: Number of attributes:64
: Attribute information:8x8 image of the integer pixels in the range 0..16.
: Missing Attribute Values:none
nrows, Ncols = 2, 5
plt.figure (figsize= (6,3))
Plt.gray () for
I in range (Ncols * nrows):
ax = Plt.subplot ( Nrows, Ncols, i + 1)
ax.matshow (digits.images[i,...])
Plt.xticks ([]); Plt.yticks ([])
Plt.title (Digits.target[i])
plt.savefig (' images/digits-generated.png ', dpi=150)
Now run the T-sne algorithm:
X = Np.vstack ([digits.data[digits.target==i] for
I in range)]
y = np.hstack ([digits.target[digits.target= =i] for
i in range
digits_proj = Tsne (random_state=rs). Fit_transform (X)
Here is a function that shows the converted DataSet
def scatter (x, colors):
# We Choose a color palette with Seaborn.
palette = Np.array (Sns.color_palette ("HLS")
#) # We Create a scatter plot.
f = plt.figure (figsize= (8, 8))
ax = plt.subplot (aspect= ' equal ')
sc = Ax.scatter (x[:,0], x[:,1], lw=0, s=40,
C=palette[colors.astype (Np.int)])
Plt.xlim ( -25, +)
Plt.ylim ( -25, +) Ax.axis ('
off ')
Ax.axis (' Tight ')
# We Add the labels for each digit.
Txts = [] for
i in range:
# Position of each label.
Xtext, Ytext = Np.median (x[colors = = I,:], axis=0)
txt = ax.text (xtext, Ytext, str (i), fontsize=24)
Txt.set_pat H_effects ([
patheffects.stroke (Linewidth=5, foreground= "W"),
Patheffects.normal ()])
Txts.append ( TXT)
return F, Ax, SC, txts
Results:
Scatter (digits_proj, y)
plt.savefig (' images/digits_tsne-generated.png ', dpi=120)
Points of different colors represent different numbers, and you can observe that the same numbers are clearly divided into different clusters.
Mathematical Framework
The following describes how the algorithm works. First, introduce several definitions:
The data point x_i is distributed in the original data space r^d, the dimension of the data space is d=64, and each point represents each picture in the handwritten digital recognition library. There are a total of n=1797 points.
Mapping point y_i in the mapping space r^2, the mapping space is our final expression to the data. There is a double-shot relationship between the data point and the mapping point, and a map point represents the original picture.
How do we choose the location of the mapping point? If the distance between the two data points is close, we want the corresponding two mapping points to be located relatively close to each other. Another |x_i−x_j| calculates the Euclidean distance between two data points, |y_i−y_j| Represents the distance of the mapping point. First, define the conditional similarity between two data points:
The formula measures the distance between the x_i and the X_j, σ_i^2 the variance of the x_i that satisfies the Gaussian distribution. The original text explains in detail the calculation of variance, which is no longer written here.
Now define the similarity degree:
We get a similar matrix from the original image, what is the matrix?
Similarity Matrix
The following function defines the computed similarity matrix function, constant σ:
def _joint_probabilities_constant_sigma (D, sigma):
p = np.exp (-D**2/2 * sigma**2)
P/= np.sum (P, Axis=1)
return P
# pairwise distances between all data points.
D = Pairwise_distances (X, squared=true)
# Similarity with constant Sigma.
P_constant = _joint_probabilities_constant_sigma (D,. 002)
# similarity with variable sigma.
P_binary = _joint_probabilities (D, +, False) # The output of this
function needs to being reshaped to a square matrix.< c10/>p_binary_s = Squareform (p_binary)
You can now display the distance matrix of the data points:
Plt.figure (figsize= (4))
pal = Sns.light_palette ("Blue", As_cmap=true)
Plt.subplot (131)
Plt.imshow ( D[::10,:: ten], interpolation= ' none ', Cmap=pal)
plt.axis (' off ')
plt.title ("Distance Matrix", fontdict={' FontSize ': Plt.subplot})
plt.imshow (p_constant[::10,:: Ten), interpolation= ' None ', Cmap=pal)
Plt.axis (' off ')
plt.title ("$p _{j|i}$ (Constant $\sigma$)", fontdict={' fontsize ': +})
Plt.subplot (133)
plt.imshow (p_binary_s[::10,:: ten], interpolation= ' none ', Cmap=pal)
plt.axis (' off ')
Plt.title ("$p _ {j|i}$ (variable $\sigma$) ", fontdict={' fontsize ': +})
plt.savefig (' Images/similarity-generated.png ', dpi=120 )
Next, define the similarity matrix for the mapping points:
Pij and Qij are close enough to achieve the goal of making data points and mapping points close enough.
Structural Analysis
If the two mapping points are farther away but the data points are closer, they will attract each other, and when the two mapping points are closer than the data points, they will repel. The final mapping is obtained when the balance is reached. The following illustration shows this feature:
algorithm
The physical analogy above derives from the mathematical algorithm, minimizing the kullback-leiber divergence of two distributions:
This measures the distance of two similar matrices. The
uses gradient descent optimization results:
U_ij corresponds to a vector of Y_j to y_i, and the gradient expresses all the elastic forces acting on the map node I.
# This list would contain the positions of the map points at every iteration. positions = [] def _gradient_descent (objective, P0, it, N_iter, n_iter_without_progress=30, momentum
=0.5, learning_rate=1000.0, min_gain=0.01, min_grad_norm=1e-7, min_error_diff=1e-7, Verbose=0,
Args=[]): # The documentation of this function can is found in Scikit-learn ' s code. p = p0.copy (). Ravel () update = Np.zeros_like (p) gains = Np.ones_like (p) error = Np.finfo (np.float). Max BES T_error = Np.finfo (np.float). Max Best_iter = 0 for I in range (it, n_iter): # We Save the current position
.
Positions.append (P.copy ()) New_error, Grad = Objective (P, *args) Error_diff = Np.abs (new_error-error) Error = New_error Grad_norm = linalg.norm (grad) If error < Best_error:best_error = Error Best_iter = i elif i-best_iter > N_iter_Without_progress:break if Min_grad_norm >= grad_norm:break if Min_error_diff >= Error_diff:break inc = Update * Grad >= 0.0 dec = Np.invert (inc) Gains[inc] + = 0.05 Gains[dec] *= 0.95 np.clip (gains, Min_gain, Np.inf) Grad *= gains update = moment UM * update-learning_rate * Grad p + = Update return p, error, I sklearn.manifold.t_sne._gradient_descent = _gradient_descent