Many people now think that neural networks can resemble the mechanisms in the human brain. I think, perhaps, some of the mechanisms in the human brain are similar, but it must be a complex system. Because the human brain does not run so fast, it can recognize the universe. So intuitive to see the human brain should be a knowledge base plus a FAST index plus cascade recognition algorithm, the reason for cascading is because to ensure speed.
But we can really not have to imitate the structure of the human brain, because the artificial intelligence must be more than the human brain in all aspects of the more than a hundredfold, just as learning to fly can not see the eagle wings but aerodynamics.
I think the most important mechanism of the human brain is the meta-reasoning ability, the so-called meta-reasoning ability is the smallest set of reasoning ability, based on which can derive more accurate and more powerful reasoning. Of course, memory storage and perceptual recognition systems are peripherals. Like why is Sherlock Holmes so smart? In fact, three is ready, first, experienced, database storage of things more, a lot of things are known to know do not know what. Second, the database index is fast and complete, according to a thing can quickly associate with the principle of its occurrence. Third, the sensory ability is strong, palpation all sharp. That's what makes Sherlock Holmes.
Because I know so much, so when I see a paper that blends decision-making forests with convolutional neural networks, I feelsomething is more close.
This blog is a paper note, this paper is the work of MSA, cited in the last, hereby declare.
Two completely different models
Decision Forest and convolutional neural network are completely different types of models, convolution neural network is a layer to layer of dense operation, is a hierarchical structure, can be seen as a graph, and decision-making forest is a node to decide which sub-node to send data to the operation, that is, the effect of data routing, tree-level structure.
However, there are also some neural networks that can be thought of with trees, such as googlenet, and Alexnet's dual GPU version, and even deepid in layers that are connected to hidden layers.
CNN's Mutation
Due to a significant portion of the hidden layer in the RELU,CNN, the value will be 0. Thus, in the measurement of relevance, it is found to be such a child.
On the left is the original, the middle is the correlation between layer1 and layer2, note that the correlation with the weight parameter is different things. The more white the dots are approaching 0. Then rearrange the dependencies, as shown in the illustration on the right. It can be found that only some of the more structured rectangular parts are strongly correlated.
Then you can delete the unrelated network connection to achieve the purpose of the deletion of the network. After the deletion of the network structure such as.
As you can see, for a sparse connected CNN, there is also a data routing effect, that is, the data is filtered and sent to different nodes.
The unification of the Representation method
In its name and in fact, want to unify the two models, or the first to unify their representation method.
First look at how CNN networks are represented.
where P is a linear transformation, the vertical wavy lines after p are nonlinear transformations, and nonlinear transformations include sigmoid, ReLU, dropout, and so on.
Then there is the way the tree is represented.
In this diagram, I is the unit matrix, and S is the matrix used for selecting subsets, and both I and S can be seen as a special case of linear transformations. A PR is a routing node that outputs a series of probabilities to determine the data that is destined for that child node. Can be a single best mode, or it can be multi-way mode, or it can be soft-routing mode (sent to all child nodes).
Calculate the amount of savings for direct data routing
Direct data routing consists of two modes, one is data transfer in decision tree, and the other is the subset of choices mentioned above. When the next step is calculated on the subset of the original data, the natural computing amount should be reduced correspondingly.
Implicit data-routing
For example, the filter is divided into two groups, you can reduce the connection between the filter, it is 100%x100%, is now 2x50%x50%, the calculation of less than half. And it's much easier to parallelize.
Back to Propagation
Then, if you put the CNN and decision tree together, for implicit data routing, you can directly reverse propagation, for subset selection, you need to add an additional parameter s, you can also directly reverse propagation. The reverse propagation is a big problem for single best tree (only to a child node). Therefore, the decision tree had to be soft-routing (transmitting data to all child nodes) for reverse propagation and then using the single best tree when testing.
For example, for the following neural networks:
Using the square difference loss function, the v* is a real tag.
Experiment Vgg The toy experiment on the last layer
Use the ImageNet1000 class data set.
Vgg, the last layer of the entire connection and Softmax layer becomes a tree structure.
- The route function is done using the Perceptron model, and the output value is normalized using Softmax as the probability
- Since the data is sent to four child nodes in the diagram, it is the unit matrix used, so I guess I should be slicing the output vectors. This can be achieved, if only to a child node, greatly reduce the amount of computation, if passed to all nodes, the computational amount is similar to the full connection. This paper still does not explain.
The effect is as follows:
- Is the tradeoff between the test time and the accuracy rate.
- The parameters for each curve change are the number of multi-routes, and the data is sent to several child nodes (I guess).
- It can be seen that the effect of ascension and time increase is the second linear, so with the increase of the fork in the tree, you can save more time in the meantime to ensure that the effect of the reduction within a certain range.
Tree-like convolution neural network
Use the ImageNet1000 class data set.
The above experiment only validates the benefits of the combination of trees and CNN on the last layer. The effect of using a fork structure in the convolutional layer has not been tested. The network structure used for this time is as follows:
The network is based on the assumption that each filter should only be convolution on some, but not all, of the input feature map.
Therefore, the network divides the filter into groups, and the filter of each group processes the feature map of the specific group in the previous layer. It can also be seen that above the odd layer from 3 above, its filter group number is 2n-2, the number of even-numbered filter group is the same as the odd-numbered layer.
Training parameters:
- Because global pooling after the last layer of convolution can significantly reduce the number of parameters in the case of minimal effects, the global pooling is performed before the last layer of the convolution layer is fully connected to the layer.
- The parameter initialization method is the same as the paper reference [9]
Learning Rate Decay
- When the accuracy rate on the validation set rises by one level, the learning rate is 10 times times attenuated, which is done two times.
- This model is twice times the number of training iterations compared to VGG11.
- Data preprocessing is done using only mirroring and random crops.
The effect is as follows:
As can be seen from the figure, this network is better than the other network except Googlenet.
Experiments on the Cifar10
- Using NIN as the control model, the first layer of the NIN model is changed to 64filter in order to simplify the 192filter.
- The model is self-learning directly through Bayesian search. Optimization in self-learning process
- Alpha=
- The Learning Network is as follows:
- In order to compare, the NIN network is Bayesian optimized at the same time.
Bayesian Search is not yet known and needs to look at the original paper.
The effect is as follows, where:
- A diamond represents a CNN without data routing
- The original Nin is used in red to indicate
- Nin with a truncated optimization using pink notation
- Circular represents CNN with data routing
- 300 networks with data routing use a gray representation
- The green indicates the optimal solution.
CNN Combo Boost
Another way to route data is the combination of CNN, which will combine two CNN and then use a route-aware combination. Such as:
All two branches use googlenet, but when tested, the path above does not use oversampling, and the path below uses 10 times times oversampling.
Oversampling I think it should be the image of the crop, four corner plus center and symmetrical four corner plus center.
It can be seen that in this way, the calculation can be reduced by half under the premise of guaranteeing the effect.
Tips
On the GPU, due to the time of data transmission, so the reduction of the overall calculation operation can not linearly lead to the reduction of the computation time, so, grouping the filter can result in the reduction of data transmission volume, thus further increase the speed.
There are now two types of parallelism:
- Parallelization of Matrix operations (BLAS)
- Data parallelization, (Mini-batch)
Summarize
Explored the various combinations between the tree and CNN. Mainly include:
- Change the hidden layer and output layer into a tree-like structure
- The convolution filter is grouped and then the convolution layer becomes a tree structure by assuming that the next layer of filter should only be done on a portion of the channels.
- From the learning network structure through Bayesian search. (This feeling is very interesting)
- Use CNN as a black box to combine different CNN combinations into a tree-like structure.
Reference documents
[1]. Ioannou Y, Robertson D, Zikic D, et al decision Forests, convolutional Networks and the Models in-between[j]. ArXiv preprint arxiv:1603.01250, 2016.
Decision-making forest and convolutional neural network er