Original page: Visualizing parts of convolutional neural Networks using Keras and Cats

Translation: convolutional neural network Combat (Visualization section)--using Keras to identify cats

It is well known, that convolutional neural networks (CNNs or Convnets) has been the source of many major breakthroughs in The field of deep learning in the last few years, but they is rather unintuitive to reason on for most people. I ' ve always wanted to break the parts of a convnet and see what an image looks like after each stage, and in this POS T I do just that!

In recent years, convolutional neural networks (CNNs or Convnets) in the field of deep learning have solved a number of practical problems for us in all walks of life. But for most people, CNN seems to be wearing a veil of mystery. I often think that if you can break down the neural network process, take a look at how well each step is going to end up. This is the meaning of this blog's existence. **CNNs at a higher level (advanced CNN)**

First off, what is convnets good at? Convnets is used primarily to look for patterns in an image. You do so by convoluting over an image and looking for patterns. In the first few layers of CNNs the network can identify lines and corners, but we can then pass these patterns down Throu GH our neural net and start recognizing more complex features as we get deeper. The makes CNNs really good at identifying objects in images.

First, let's look at what convolutional neural networks are good at. CNN is mostly used to find patterns in images. This process has two steps, the first thing to do is to make a convolution of the image, and then find the pattern. In neural networks, the first few layers are used to look for boundaries and angles, and as the number of layers increases, we can identify more complex features. This nature makes CNN very good at identifying objects in a picture. **What's a CNN? (CNN is nothing.) )**

A CNN is a neural network that typically contains several types of layers, one of the which is a **convolutional layer**, As well as **pooling**, and **activation** layers.

CNN is a special kind of neural network that contains **convolutional layers** , **pooled layers** , and **activation layers** . **convolutional layer (convolutional layers)**

To understand what's a CNN is, the need to understand how convolutions work. Imagine You has an image represented as a 5x5 matrix of values, and a 3x3 matrix and slide that 3x3 window Aroun d the image. At all position the 3x3 visits, you matrix multiply the values of your 3x3 windows by the values in the the image that is cur Rently being covered by the window. This results with a single number the represents all the values in that window of the image. Here's a pretty gif for clarity:

To understand what a convolutional neural network is, you first need to know how the convolution works. Imagine you have a 5*5 matrix representation of the picture, and then you use a 3*3 matrix to slide in the picture. Each time the 3*3 matrix passes the point it is multiplied by the matrix that is covered in the original matrix. In this way, we can use a value to represent all the points in the current window. The following is a dynamic diagram of a process:

As can see, each item in the feature matrix corresponds to a section of the the image. Note the value of the kernel matrix is the red number in the corner of the GIF.

As you can see, each item in the feature matrix is associated with an area in the original.

The "window" that moves over the image is called a **kernel**. Kernels is typically square and 3x3 was a fairly common kernel size for small-ish images. The distance the window moves each time is called the **stride**. Additionally of note, images is sometimes padded with zeros around the perimeter when performing convolutions, which damp Ens the value of the convolutions around the edges of the image (the idea being typically the center of photos Matter more ).

Moving like a window in a diagram is called a **nucleus** . Nuclear is generally a square, for small pictures, the general choice of 3*3 matrix can be. The distance that each window moves is called the **stride** . It is important to note that some images are filled with zeros at the boundary, and if the convolution operation is done directly, the data at the boundary will be smaller (of course, the data in the middle of the image is more significant).

The goal of a convolutional layer is **filtering.** As we move over a image we effective check for patterns in this section of the image. This works because **of filters, stacks of** weights represented as a vector, which is multiplied by the values OUTP Uted by the convolution. When training a image, these weights change, and if it's time to evaluate a image, these weights return high value s if it thinks it is seeing a pattern it has seen before. The combinations of the weights from various filters to the network predict the content of an image. This is what in CNN architecture diagrams, the convolution step was represented by a box, not by a rectangle; The third dimension represents the filters.

The main purpose of the convolution layer is **filtering** . When we are working on a picture, we can easily check out that part of the pattern, because we use the **filter** , we multiply the weight vector by the output after the convolution. When you train a picture, these weights change, and when you encounter a pattern that you've seen before, the corresponding weight increases. The combination of high weights from various filters gives the network the ability to predict the content of the image. That's why, in the CNN architecture diagram, the convolution step is represented by a box instead of a rectangle; The third dimension represents the filter.

Things to note: **(precautions:)** The output of the convolution is smaller (in width and height) than the original image A linear function is applied betwee n the kernel and the image window that's under the kernel Weights in the filters be learned by seeing lots of images convolution The calculated output, whether in width or height, than the original small core and the picture window is a linear operation the weights in the filter are learned through many pictures of the **Pooling Layers (pool layer)**

Pooling works very much like Convoluting, where we take a **kernel** and move the kernel over the image, the only dif Ference is the function of the applied to the kernel and the image window isn ' t linear.

The pooling layer is similar to the convolution layer and is also used to move on the graph with a convolution **core** . The only difference is that the operation of the kernel and picture windows in the pool layer is no longer linear.

**Max pooling** and **Average pooling** is the most common pooling functions. Max pooling takes the largest value from the window of the "image currently covered by the kernel" while average pooling ta Kes The average of all values in the window.

**maximum** pooling and **averaging pooling** are the most common pooling functions. Maximum pooling selects the maximum number of picture windows currently covered by the kernel, while averaging pooling is the average of the selected picture window.

**Activation Layers (activation layer)**

Activation layers work exactly as with other neural networks, a value is passed through a function that squashes the value I Nto a range. Here's a bunch of common ones:

In CNN, activation functions, like other networks, compress values in a range. Some common functions are listed below:

The

The most used activation function in CNNs is the Relu (rectified Linear Unit). There is a bunch of reason so people like Relus, but a big one is because they be really cheap to perform, if the numb Er is Negative:zero, else:the number. Being cheap makes it faster to train networks.

The

Most commonly used in CNN is the Relu (fixed linear unit). People have a lot of reasons to like Relu, but the most important thing is that it is very easy to implement, if the number is negative then output 0, otherwise the output itself. This function is simple to operate, so the training network is very fast. ** Recap (review:) ** Three main types of layers in CNNs: ** convolutional, Pooling, Activation ** ** convolutional layers multiply kernel value by the image window and optimize the kernel weights over time using gradient descent **** Po Oling layers ** Describe a window of an image using a single value which are the max or the average of that window activation Layers Squash The values into a range, typically [0,1] or [ -1,1] There are three main layers in CNN, respectively: ** convolution layer , **** pooling layer **, and ** activation layer **. The ** convolutional layer ** uses the convolution kernel and the picture window to multiply, and uses the gradient descent method to optimize the convolution kernel. The ** pooling layer ** describes a graphics window using either a maximum or a mean value. The ** activation layer ** uses an activation function to compress the input into a range, typically [0,1][-1,1]. ** What does a CNN look like? (What is CNN like?) )**

Before we get into what a CNN looks like, a little bit of background. The first successful applications of Convnets is by Yann LeCun in the "s, he created something called LeNet, that could be used to read hand written numbers. Since then, computing advancements and powerful GPUs had allowed researchers to being more ambitious. In the Stanford Vision Lab released ImageNet. Image NET is data set of million images with labels detailing the contents of the images. It has become one of the world's standards for comparing CNN models, and the current best models would successfully D Etect the objects in 94+% of the images. Every so often someone comes in and beats the all time high score on imagenet and its a pretty big deal. In the It is googlenet and vggnet, before that it is ZF Net. The first viable example of a CNN applied to Imagenet is AlexNet in, before this researches attempted to use Traditi Onal Computer Vision Techiques, but AlexNet outperformed everything Else up to that point by ~15%.

Before we dive into CNN, let's add some background information. As early as the 90 's, Yann LeCun used CNN to do a handwritten digital recognition program. With the development of the times, especially the improvement of computer performance and GPU, researchers have more abundant imagination space. The Imagenet project was released in 2010 at Stanford's Machine Vision Laboratory. The item contains 14 million pictures with a description tag. This has almost become the standard for comparing CNN models. At present, the best model can achieve a 94% accuracy rate on this data set. People are constantly improving the model to improve the accuracy rate. In the 2014 googlenet and Vggnet became the best model, and before that was zfnet. The first viable example of CNN applied to Imagenet was alexnet, where researchers tried to use traditional computer vision techniques, but Alexnet performed 15% more than anything else.

Anyway, lets look at LeNet:

Let's take a look at Lenet:

This diagram doesn ' t show the activation functions, and the architecture is:

The activation layer is not shown in this diagram, and the whole process is:

Input Image→convlayer→relu→maxpooling→convlayer→relu→maxpooling→hidden Layer→softmax (activation) →output Layer

Input picture → convolutional layer →relu→ max pooling → convolutional layer →relu→ maximum pooling → hidden layer →softmax (activation) → output layer. On to the **cats! (Let's look at a practical example)**

Here's an image of a cat:

The image below is a picture of a cat:

Our picture of the cat has a height 320px, a width of 400px, and 3 channels of color (RGB).

This image has a length of 400 pixels wide by 320 pixels and has three channels (RGB) color. **convolutional Layer**

So what does he look like after one layer of convolution?

So what does it look like after a layer of convolutional operations?

Here are the cat with a kernel size of 3x3 and 3 filters (if we had more than 3 filter layers we cant plot a 2d image of T He cat. Higher dimensional cats is notoriously tricky to deal with.).

This is the effect of processing with a 3*3 convolution core and three filters (if we have more than 3 filters, then I can draw a 2d image of the cat.) Higher dimensions are difficult to handle)

As you can see the cat was really noisy because all of our weights be randomly initialized and we haven ' t trained the NETW Ork. Oh and they ' re all on the top of each of them so even if there is detail on the each layer we wouldn ' t is able to see it. But we can make out areas of the cat, were the same color like the eyes and the background. What happens if we increase the kernel size to 10x10?

We can see that the cat in the picture is very vague because we used a random initial value, and we haven't trained the network yet. They are at the top of each other, and even if each layer has details, we will not be able to see it. But we can make the area of the cat with the same color as the eye and background. What happens if we increase the kernel size to 10x10.

As we can see, we lost some of the detail because the kernel is too big. Also note the shape of the image is slightly smaller because of the larger kernel, and because math governs stuff.

We can see that because the kernel is too big, we lose some detail. Also note that, from a mathematical standpoint, the larger the convolution nucleus, the smaller the shape of the image becomes.

What happens if we squish it down a bit so we can see the color channels better?

If we flatten it a bit, we can better see what happens to the color channel.

Much better! Now we can see some of the things our filter is seeing. It looks like red was really liking the black bits of the nose an eyes, and the blue are digging the light grey that outlines th E Cat. We can start to see how the layer captures some of the more important details in the photo.

This one looks a lot better. Now we can see some of the things our filters see. It looks like the red replaces the black nose and black eyes, and blue replaces the light gray of the cat's border. We can start to see how layers capture some of the more important details in a photo.

If We increase the kernel size its far more obvious now so we get less detail, but the image is also smaller than the OT Her.

If we increase the kernel size, the details we get will become more and more obvious, and of course the image is smaller than the other two. **Add an Activation layer (adds an activation layer)**

We get rid of the of a lot of the is blue-ness by adding a relu.

By adding a relu, we removed a lot of parts that were not blue. **Adding a Pooling layer (adds a pooled tier)**

We Add a pooling layer (getting rid of the activation just max it a bit easier to show)

We add a pooled layer (get rid of the activation layer to maximize the image to make it easier to display).

As expected, the cat is blockier, but we can go even blockyier!

As expected, the cat becomes mottled, and we can make it more mottled.

Notice How the image was now about a third the size of the original.

Now the picture is about the original One-third. **Activation and Max Pooling (active and maximum pooling)**

**LeNet Cats**

What does the Cats look like if we put them through the convolutional and pools sections of LeNet?

What if we put the picture of the cat in the Lenet model for convolution and pooling?

**Conclusion**

Convnets is powerful due to their ability to extract the core features of a image and use these features to identify IMA Ges that contain features like them. Even with our own layer CNN we can start to see the network is paying a lot of attention to regions like the whiskers, NOS E, and eyes of the cat. These is the types of features that would allow the CNN to differentiate a cat from a bird for example.

Convnets is powerful because they extract the core features of an image and use those features to identify the images that contain the features. Even on our two-story CNN, we can start to see that the Internet is giving a lot of attention to areas like cat whiskers, noses and eyes. These are the types of features that allow CNN to separate the cat from the bird area.

CNNs is remarkably powerful, and while these visualizations aren ' t perfect, I hope they can help people like myself who a Re still learning to reason about convnets a little better.

CNN is very powerful, although these visualizations are not perfect, but I hope they can help people like me who are trying to better understand convnets.

All code are on Github:https://github.com/erikreppel/visualizing_cnns

Follow me on Twitter, I ' m @programmer (yes, seriously). **further Resources**

Andrej Karpathy ' s cs231n

A Guide to convolution arithmetic for deep learning by Vincent Dumoulin and Francesco Visin