**mnist Data Set**
The official website of the Mnist dataset is Yann LeCun ' s website. Here we provide a copy of the Python source code for automatic download and installation of this data set. You can download this code and then import it into your project using the following code, or you can copy and paste it directly into your code file.

Import input_data
mnist = input_data.read_data_sets ("mnist_data/", One_hot=true)

The downloaded dataset is divided into two parts: a 60000-line training dataset (Mnist.train) and a 10000-row test data set (Mnist.test). Such segmentation is important, and the machine learning model must be designed with a separate test data set that is not used for training but to evaluate the performance of the model, making it easier to generalize the design model to other datasets (generalizations).

As mentioned earlier, each Mnist data unit consists of two parts: a picture with handwritten numbers and a corresponding label. We set these images to "XS" and set these tags to "ys". Both the training dataset and the test data set contain XS and YS, for example, the picture of the training data set is Mnist.train.images, and the label for the training dataset is mnist.train.labels.

Each picture contains 28 pixels X28 pixels. We can use a number array to represent this image:

We expand this array into a vector with a length of 28x28 = 784. How to expand this array (the order between the numbers) is unimportant, as long as the individual images are expanded in the same way. From this perspective, a picture of the Mnist dataset is a point within a 784-dimensional vector space and has a more complex structure (a reminder that the visualization of such data is computationally intensive).

Flattening a digital array of pictures loses two-dimensional structure information about the picture. This is obviously not ideal, and the best computer vision methods will unearth and utilize these structural information, which we'll cover in a follow-up tutorial. But in this tutorial we ignore these constructs, the simple mathematical model introduced, the Softmax regression (Softmax regression), which does not take advantage of these structural information.

Therefore, in the Mnist training dataset, Mnist.train.images is a tensor of shape [60000, 784], the first dimension number is used to index the picture, and the second dimension number is used to index the pixels in each picture. Each element in this tensor represents the intensity value of a pixel in a picture, with values between 0 and 1.

The corresponding mnist dataset is labeled with a number from 0 to 9, which describes the number represented in a given picture. In order to use this tutorial, we make the label data "one-hot vectors". A one-hot vector in addition to the number of one digit is 1 other than the number of dimensions is 0. So in this tutorial, the number n is represented as a 10-dimensional vector that has a number of 1 only in the nth dimension (starting at 0). For example, label 0 will be represented as ([1,0,0,0,0,0,0,0,0,0,0]). Therefore, Mnist.train.labels is a digital matrix of [60000, 10].

**Softmax Regression Introduction**

We know that each picture of mnist represents a number, from 0 to 9. We want to get the probability that a given picture represents each number. For example, our model might speculate that a 9-based image represents a 9 probability of 80% but it is 5% (because 8 and 9 have a small circle on the upper half), and then gives it a lower probability of representing other numbers.

This is a classic case of using the Softmax regression (softmax regression) model. The Softmax model can be used to assign probabilities to different objects. Even after we train more elaborate models, the final step is to use Softmax to allocate probabilities.

Softmax regression (Softmax regression) is divided into two steps:

The **first step:**

To obtain evidence of a given picture belonging to a particular number class (evidence), we weighted the sum of the pixel values of the picture. If this pixel has strong evidence that this picture does not belong to the class, then the corresponding weights are negative, conversely if the pixel has favorable evidence to support this picture belongs to this class, then the weight value is positive.

The following picture shows the weights for a particular number class for each pixel on the image that a model learns. Red represents a negative weight value, and blue represents a positive weight value.

We also need to add an extra offset (bias), because the input tends to have some extraneous interference amount. So for the given input picture X it represents the evidence of the number I can be expressed as:

Evidencei=∑wi,jxj+bi evidence_i = \sum W_{i,j}x_j + b_i

Where Wi,j W_{i,j} represents the weight, the bi b_i represents the offset of the number I class, and J J represents the pixel index of the given picture x x for the pixel summation. The Softmax function can then be used to convert the evidence into the probability y y:

Y=softmax (evidence) y = Softmax (evidence)

The Softmax here can be seen as an excitation (activation) function or link function that transforms the output of our defined linear function into the format we want, that is, the probability distribution of the 10 number classes. Therefore, given a picture, it can be converted into a probability value by the Softmax function for each number's degree of coincidence. The Softmax function can be defined as:

Softmax (x) =normalize (exp (x)) Softmax (x) = Normalize (exp (x))

Expand to the right of the equation, you can get:

Softmax (x) =exp (xi) ∑jexp (XJ) Softmax (x) = \frac{exp (x_i)}{\sum _{J} exp (X_j)}

But more often, the Softmax model function is defined as the previous form: the input value is evaluated as a power exponent, and the result values are then regularization. This power operation indicates that the larger evidence corresponds to the multiplier weight value within the larger hypothetical model (hypothesis). Conversely, having less evidence means having a smaller multiplier factor in a hypothetical model. Assume that the weights in the model can not be 0 or negative values. The Softmax then regularization these weights so that their sum equals 1, which constructs an effective probability distribution. (More information about the Softmax function can be found in this section of Michael Nieslen's book, which has an interactive visual interpretation of Softmax.) ）

For the Softmax regression model can be explained in the following diagram, for the input XS weighted sum, and then add a bias, and then input into the Softmax function:

If we write it as an equation, we can get:

We can also use vectors to represent this computational process: adding matrix multiplication and vectors. This helps to improve computational efficiency. (also a more effective way of thinking)

Further, it can be written in a more compact way:

Y=softmax (WX) +b y = Softmax (WX) + b

**Implementing regression Models**

In order to achieve efficient numerical computations using python, we typically use libraries such as numpy to implement complex operations like matrix multiplication using other external languages. Unfortunately, switching back from an external calculation to Python is still a big expense for each operation. If you use the GPU for external computing, the overhead will be greater.