Recently, on DEF CON 2018, a prestigious event in the field of global security, GeekPwn Las Vegas Station held the CAAD CTF Invitational, and six teams of top AI scholars and research institutes from home and abroad explored the CTF with a view to combating training as a means of attack. Tsail team Pang Tianyu, Du Su as a representative to win the competition, the key members of the competition include Dong Yinpong, Wessing, etc., Tsail team from the Tsinghua University, the main research area for machine learning.

Also last year, the team won the championship in three game missions (with/without specific target attacks, attack detection) in the NIPS AI adversarial competition, defeating more than 100 teams, including Stanford, Johns Hopkins University and other world-renowned universities, An important step in the robustness and security application of AI models.

In this year's CADD CTF attack, players are required to launch a "targeted confrontation sample" against other teams based on the assigned images of a randomly matched team, while defending against "sample samples" from other groups. The attack was completely black, and the teams were unable to get any information about the opponent's model, such as loss functions, model schemas, input-output sample pairs, and so on.

Combat attack

A confrontation sample is a sample of an attacker who, by adding invisible noise to a real sample, causes a prediction error in the deep learning model, as shown in the image given by a panda, the attacker adds a tiny noise disturbance to the image, although the human eye is difficult to distinguish, but the model is mistakenly categorized as a gibbon with a very high probability. With the large-scale application of machine learning, this kind of error is very important for system security, and the CAAD competition is hoping to discuss how to strengthen the robustness of the system for this kind of confrontation samples.

*For Ian Goodfellow's demonstration of the confrontation in 14, the confrontation sample was derived from an algorithm called FGSM.*

In general, combat attacks can be divided into white-box attacks, black-box attacks, directional attacks, and generic attacks. A white-box attack is an attacker who has full access to the attacked model, which means that the attacker creates a sample of the confrontation that can deceive it, knowing the schema and parameters of the model. Black-box attacks indicate that an attacker can only observe the input and output of the attacked model, such as a black-box attack through an API attack on a machine learning model, because an attacker can only construct a confrontation sample by observing the input and output pairs.

In the CAAD CTF competition, the players need to use a directional generic attack. In the case of directed attacks, an attacker would want to be able to create a confrontation sample to deceive the target system into a specific category, for example, we might want to build a confrontation sample that would make the image recognition system classify it as a "puppy" and other specific categories. In the case of a generic attack, an attacker attempts to design an image disturbance transformation that can deceive a system without knowing any information about the attack. Therefore, in the CAAD CTF contest, the player can not only access the model structure and parameters of each other, but also cannot access the input and output samples of the attacked system.

Currently the most popular attack methods are based on gradient and iterative methods, and many other excellent and advanced attack methods are based on their main ideas. The main idea of this type of approach is to find a small disturbance that maximizes the change in the loss function, so that the model is mistakenly categorized into other categories by adding this tiny disturbance to the original input. It is usually simple to calculate the derivative of the loss function against the input along the reverse propagation and maximize the loss function based on the derivative, so that the attacker can find the optimal perturbation direction and construct a confrontation sample to deceive the depth network.

For example, the Fast Gradient sign Method (FGSM), proposed by Goodfellow in 2014, if we make θ indicate that the parameters of the model, x and y represent input and output, J (θ, x, y) are loss functions that train the neural network, then we can do so in the current θ value The neighborhood linearly approximates the loss function and obtains the optimal maximum norm constraint disturbance:

By adding the optimal disturbance to the original input "Panda" as shown, the system can classify it as "gibbon" by mistake. The FGSM can quickly calculate gradients by reverse propagation and find the smallest perturbation η that increases the model loss. Others, such as the basic iterative approach (BIM), use smaller step iterations to iterate multiple FGSM, resulting in better-performing confrontation samples.

Of course, only white box attacks will not cause a greater impact, it is scary to resist the migration of samples, which is the CAAD CTF such a directional generic attack feasible reasons. A portable attack means that we don't know what machine learning models, specific parameters, and training sets are used by the target, but we can train our own models and build confrontation samples from similar datasets, which are likely to deceive unknown target models because of the mobility.

Then in 2016, researchers such as Yanpei Liu presented a model-based approach to attack, saying that when a sample can be tricked into multiple known models of integration, it has a very large likelihood of deceiving unknown models. The Tsail team also said that the integration approach was important in the actual race, and they integrated multiple common convolutional neural networks to build confrontation samples such as Inception V4, ResNet, and Dencenet. Because the integration can significantly improve the migration of the anti-sample, they will be able to complete the attack without acquiring any target system information.

In addition, the Tsail team said they would increase their antagonistic attack by momentum. They say that a confrontational attack is one of the important alternatives to assessing robustness before deploying a deep learning model. However, the probability that most existing adversarial attacks can successfully confuse a black box model is low. In order to solve this problem, they proposed a kind of momentum-based iterative algorithm with a wide range of levels to enhance the ability to attack. By integrating the momentum into the attack iteration, the model can obtain a more stable update direction, avoid inferior local maximums in the iterative process, and produce more portable countermeasures samples at the same time. To further improve the success rate of black-box attacks, they applied the momentum iteration algorithm to a model set, demonstrating that the trained model was still do anything helpless in the face of their black-box attacks, even with a strong defensive capability.

Fight against the defense

The Tsail team in Tsinghua University also focused on building a more robust model of defense against samples, in which the Tsail team needed to defend against sample attacks from other contestants while directing attacks against other models. The Tsail lab has proposed two methods of defending against samples, both of which attempt to modify the loss function for better stability.

In the paper Towards robust Detection of adversarial Examples, researchers such as Pang Tianyu have indicated that the DNN classifier can be forced to map all normal samples to a similar location in a low-dimensional manifold, so that when the model receives a confrontation sample, It can easily differentiate it from normal samples. In this paper, they propose a loss function called the inverse cross entropy (reverse cross-entropy,rce) and indicate that minimizing the RCE loss function in training will encourage deep neural networks to learn to differentiate between the hidden space of the opposing sample and the normal sample.

The researchers used the paper to show why neighbors who map a normal sample to a low-dimensional manifold are resistant to the confrontation sample. Where Non-me is the normalized non-maximum entropy of information, it calculates the information entropy of other classes of predictive probabilities after removing the most probable predictions, a measure that distinguishes a confrontation sample from the use of Softmax. As shown in 1 A for the last layer of the neural network to hide the classification boundaries of the space, Non-me will constrain the normal samples together.

*The three black solid lines in figure 1:a are the decision boundaries of the classifier, and the blue dashed line is the contour of non-me = T. B and C are the T-sne visualizations of the last hidden layer vectors, the models are ResNet-32 trained on CIFAR-10, except that B uses the general cross-entropy loss function, and C uses RCE.*

As shown above, the Z_0 is the original normal sample, which is mapped to the nearest neighbor of the inverse extension line, that is, between the blue contours. When we do not use the anti-sample detection metric, z_1 is in the vicinity of the decision boundary relative to Z_0, and it may be a very successful confrontation sample. But when we use non-me as a measure against sample detection, z_1 can easily be filtered out because it is not near the real sample. In this case, the successful confrontation sample will appear in the z_2 position, where the categorical boundary coincides with the nearest neighbor boundary.

The researchers say that if you want to achieve this effect in the last layer of hidden space, we need to use the REC loss function in training. The REC loss function is shown below, where r_y represents the inverse label vector, that is, the Y callout value is set to zero, and the other category value is 1/(L-1). In addition, F (x) is the predictive value of the model, so RCE measures the cross-entropy between the anti-label vector and the predicted value.

By minimizing the RCE loss function during training, the network encourages the classifier to return a higher confidence level on the correct category, and returns a uniform distribution on the wrong category. It also further causes the classifier to gather the normal samples of the same class on the low-dimensional manifold, i.e. to separate the normal samples from the opposing samples in the hidden space on the last layer of the neural network. In addition, the new loss function can theoretically prove its convergence, and it is trained as a cross-entropy loss function using the generic SGD.

In another paper max-mahalanobis Linear discriminant analysis Networks, researchers at Tsinghua University's Tsail team explored another way to fight against a sample. Specifically, they define a special Gaussian mixture distribution max-mahalanobis, and theoretically prove that if the input distribution is MMD, then the linear discriminant analysis (LDA) is very robust against the sample.

Based on this discovery, they put forward the Mm-lda network. In short, the network maps complex input data distributions to hidden feature spaces that obey the max-mahalanobis distribution, and uses LDA to make the final predictions. So the network is important to understand the max-mahalanobis distribution, and why it can defend against samples.

As shown above, a max-mahalanobis distribution based on the number of different classes of L is presented, where μ is the mean of different Gaussian distributions, and they are the vertices of the graph respectively. The variance of the Gaussian distribution in the MMD is fixed to the unit variance, and there are no other special requirements. However, the distribution of mean μ in MMD needs to satisfy some conditions, that is, the maximum distance between the nearest two μ is required, so that the different categories require the most distribution.

When the class L is 3, we want the last layer of the constrained neural network to map the normal samples belonging to Class I to the distribution N (Z|μ_i, i), where μ_1, μ_2, and μ_3 need to be scattered as far as possible to approximate equilateral triangle. Formally, we need to maximize the minimum distance between μ, i.e. max{min (D_12, d_13, d_23)}, where D_12 represents the distance between Μ_1 and μ_2.

Because the MMD makes the interval of the various average values approximate, the neural network maps each class to a Gaussian distribution that is far away from each other, so that the final prediction can be made quickly by linear discriminant analysis. On the whole, the researchers suggest that the Mm-lda network will first have a deep network that maps the input data x to the hidden feature representation Z, forcing Z's distribution P (z) to obey the MMD and then using LDA to make predictions on Z.

In order to force the last layer of the neural network to obey the max-mahalanobis distribution, we need to limit the distribution of the labels to:

The prior probability π and mean μ* of each class are predefined according to MMD, and the prediction of categories based on the Softmax function rather than the usual ones is equivalent to the introduction of max-mahalanobis distribution. Finally, in the course of training, the cross-entropy loss function of minimizing the labeling sample and the Model Prediction P (Y|z (x;θ)) can make the z approximation subordinate to the MMD. In addition, as the entire network only needs to modify the loss function, it can be applied directly to different depth models for better robustness.

Tsinghua University AI Research Institute Tsail Team

In addition to these two studies on robust machine learning systems, the lab has a lot of research in the field of AI security, such as defense against attacks and defenses. For example, in the fight against attack and defense of image recognition, AI Security also includes image scene segmentation, video classification, text and image data on the attack and defense, this time the game is only a relatively small part of the AI security field. In addition, the laboratory has done a lot of research on probabilistic machine learning, such as Bayesian machine learning, this aspect of the research results are reflected in the "Abacus (Zhusuan)" This open Source library.

The team has advanced technology in the field of understandable AI, leading the technology in the areas of AI decision making, Ai understanding, and AI security. 2017, team members received the Google-hosted NIPS AI security attack contest all three projects in the world first; get Kaggle data Science Bowl 2017 first prize (500,000 USD Bonus); Get Innovation Workshop AI Challenge 2017 Image Chinese description Project first 2018 won the first place in the Vizdoom Robot Shootout contest. The "Abacus (Zhusuan)" Bayesian deep learning platform developed by the research group has a wide impact on the international AI and machine learning fields.

Against the sample attack and defense, Tsinghua University Tsail team again won the Caad attack game first