Privacy and machine learning

Source: Internet
Author: User
Tags machine learning machine learning application machine learning model pate framework privacy protection

Recently, Internet privacy has caused public concern. In the past, there was Facebook "data gate". Xiaoza had to change the gray short-sleeved and denim clothes that he often wears, and put on a dark blue suit to participate in the congressional hearing. Later, Baidu President Li Yanhong said Chinese users are willing to use the convenience and efficiency of privacy to cause strong resentment from netizens. Netizens commented that sacrificing privacy may not be in exchange for efficiency, and may also be exchanged for death. For example, if you search for Putian Hospital, you may also lose economic losses, such as big data killing. Etc. Recently, there are some mandatory privacy collection clauses, which have caused some users to uninstall the APP. Many domestic APPs do not agree to give relevant permissions, and they cannot use them normally. This is really a dilemma. Why do many applications collect data now? The book "A Brief History of the Future" gives the answer - the future world data is king, and humans may just be an enlarged version of the ants used to generate data. With the data, a lot of things can be done with the right algorithms, all related to machine learning, deep learning, and data science. People worry that their data will be leaked or improperly used after being collected. Therefore, how to protect private data is one of the main issues that companies need to consider. This article will analyze the relationship between privacy and machine learning, and design a PATE framework, which can avoid passively revealing user privacy data. Let's take a look at it.

In many machine learning applications, such as machine learning for medical diagnostics, it is desirable to have an algorithm that does not store user sensitive information (such as the specific medical history of individual patients) to perform the corresponding tasks. Differential privacy is a widely-recognized privacy protection model that protects the privacy information of potential users in a lock release data by adding interference noise to the data, so that even if the attacker has mastered a certain piece of information. Other information, still can not guess this information. With differential privacy, a suitable machine learning algorithm can be devised to responsibly train the model on private data. A series of work by the group (Martín Abadi, Úlfar Erlingsson, etc.) is based on how differential privacy makes machine learning researchers more conducive to privacy protection. This article will explain how to make privacy and machine learning enjoyable. Synergy.

The team's latest work is the Private Aggregation of Teacher Ensembles (PATE), published on the 2018 ICLR. One of the important contributions is that researchers who know how to train a supervised machine learning model will help to study the differential privacy for machine learning. The PATE framework implements privacy learning by carefully coordinating the activities of several different machine learning models. As long as the PATE framework is specified, the generated model will be protected by privacy.


Why do you need a privacy machine learning algorithm?

The way machine learning algorithms work learns the relationships in data through large amounts of data and updating its parameters. Ideally, it is hoped that these machine learning models will learn from general models (such as "smoking patients are more likely to have heart disease") rather than training specific examples (such as "Jane Smith suffers from heart disease"). Unfortunately, machine learning algorithms don't learn to ignore these details by default. If you want to use machine learning to solve an important task, such as a cancer diagnosis model, when you publish a machine learning model (such as an open source cancer diagnostic model), you may be unintentionally Revealing information about the training set, a malicious attacker may obtain private information about Jane Smith from the published model, which is where the differential privacy application comes from.


How to define and protect privacy?

Scientists have proposed a number of methods to provide privacy protection when analyzing data. The more popular approach is to remove private details or random value substitutions before analyzing the data. Details such as phone numbers and zip codes are generally handled anonymously. However, anonymous data is not always sufficient. When an attacker obtains auxiliary information about an individual in a data set, the privacy provided by such an anonymous operation is greatly reduced. Therefore, it is difficult to define and protect privacy, and it is difficult to estimate the range of information that an attacker can obtain.

Differential privacy is a model for privacy protection. The basic idea is to randomize the behavior of some mechanisms to provide privacy. The intuition to introduce randomness into the learning algorithm is that it is difficult to distinguish the privacy behavior from the trained model based on the training data. .

The differential privacy version used in this article is roughly the same as the probability that any particular parameter learned when a training set changes an example. The reason for this is that if the data of a single patient (Jane Smith) does not affect the results of the model learning, then the patient's data will not be recorded and its privacy will be protected. This article refers to this probability as a privacy budget, and a smaller privacy budget corresponds to a stronger privacy protection.

As shown in the above figure, differential privacy is implemented when an attacker cannot distinguish the answer generated by the random algorithm from the data of two users by an algorithm based on three user data training.

What is the intuition behind PATE?

The PATE method provides visibility into machine learning based on a simple intuition—if two different classifiers train on two data sets that do not have a common training instance, and have the same output for a new input, then The decision was made without revealing information about any individual training examples. Since the categories predicted by each model may reveal some of the private information contained in their training data, the category output of each model cannot be published separately. For example, suppose that Jane Smith's data only contributes to the training data of one of the two models, and that the model predicts that patients who are very similar to Jane Smith's recorded data are cancerous, while the other model predicts the opposite. You can reveal Jane Smith's private information. This example also explains why adding randomness to an algorithm is necessary to ensure that it provides any meaningful privacy protection.

How does PATE work?

In PATE, the private data set is first separated in the data subset, as shown in the following figure. If Jane Smith is recorded in a private data set, then it exists only in one partition. The machine learning model is trained on each partition and is referred to as the teacher model. The training method of the teacher model is unconstrained, which is one of the main advantages of PATE. All teachers solve the same machine learning tasks, but they are all trained independently. That is, only one teacher analyzed the data recorded by Jane Smith.

How to use this set of independently trained teacher models to ensure privacy? In PATE, noise is added when all teachers are aggregated to predict and form a consensus. The number of teachers voting for each category is counted, and random noise of Laplacian or Gaussian distribution is added to disturb the statistics. When the number of votes for the two output categories is the same, one of them is randomly selected. Also, if most of the teacher's output categories point to the same category, adding noise does not change the category to get the most votes. Therefore, noise such as Laplace is introduced, and the statistics of the number of votes are disrupted to protect privacy.

The polymerization process is illustrated by taking two types of medical diagnostic tasks as an example. If Jane Smith has cancer, the output of the aggregation mechanism is analyzed below. The red model in the image below is the only teacher model trained on Jane Smith data, so it is possible to predict recorded data similar to Jane as a cancer patient. As can be seen from the figure, two teachers voted “cancer” and the remaining two teachers voted “healthy”. Adding random noise to the vote count prevents the aggregated result from reflecting the vote of any individual teacher to protect privacy.

The framework has two limitations: first, each prediction made by the aggregation mechanism increases the total privacy budget; secondly, the collection of teacher models cannot be published open source, otherwise the attacker can check the published model parameters to understand the training. Private data used. Therefore, a student model is created.

The student model is trained in a privacy-protected manner by sharing the knowledge acquired by the teacher. The student model selects input data from a set of unmarked public data and submits these inputs to the teacher model to obtain the tags, which then use the tagged data to train the model.

PATE makes privacy and learning synergy

Privacy protection and the correctness of predictive labels by aggregation mechanisms stem from the consensus reached among teachers. When most teachers vote for a category, increasing the noise does not change the category results that vote the most, indicating that the aggregation mechanism has a very strong privacy guarantee.

Differential privacy is well aligned with machine learning tasks. For example, in the learning process, remembering special training examples like the Jane Smith case is a violation of privacy. It is also an over-fitting phenomenon that reduces the generalization of the model. ability.

This process prompted us to improve the aggregation mechanism to make it selective: the teacher only responded to some of the questions asked by the students. When the teacher asks questions, first check whether the consensus level between the teachers is high enough. If the number of votes is greater than the set threshold, the student's inquiry is accepted, otherwise the rejection is rejected. The threshold settings are random to provide privacy during the selection process. The figure below shows the process:

In the experiment, the privacy budget is mainly spent on two aspects: selecting and answering the query. However, there is a high degree of consistency between the teachers, so the budget for choosing to answer the query is very small. The following figure shows the performance improvement of different query problem functions, namely the original mechanism (Simple GNMax) and the optimized mechanism (Confident GNMax). When using data-dep analysis, use moments accountant. And RDP (R'enyi Differential Privacy).

How do machine learning researchers use PATE to improve models?

There are two strengths to privacy protection provided by the main impact methods:


  1. Consensus among teachers: The stronger the consensus, the less the privacy budget spent on outputting the corresponding tags.

  2. Number of student enquiries: When a student asks for a tag, the budget for the teacher to generate the tag will be added to the total privacy cost, training the teacher as little as possible, and enhancing the privacy protection provided.

Strengthening the consensus of teachers requires training a large number of teacher models on a small data set. Improving the accuracy and generalization ability of these models will help to improve consensus.

The PATE framework is open source and can be called from the TensorFlow model library. Use the following code to clone related programs and set variables:


cd

git clone https://github.com/tensorflow/models

cd models

export PYTHONPATH=$(pwd):$PYTHONPATH

cd research/differential_privacy/multiple_teachers


The first step is to train the teacher model. The following demonstrates training 250 teacher models on the MNIST handwritten data set:


python train_teachers.py --nb_teachers=250 --teacher_id=0 --dataset=mnist

python train_teachers.py --nb_teachers=250 --teacher_id=1 --dataset=mnist

...

python train_teachers.py --nb_teachers=250 --teacher_id=248 --dataset=mnist

python train_teachers.py --nb_teachers=250 --teacher_id=249 --dataset=mnist


After training, load these teacher models and apply the aggregation mechanism to supervise the training of the student model:


python train_student.py --nb_teachers=250 --dataset=mnist --stdnt_share=1000 --lap_scale=20 --save_labels=True


It can be seen that the Laplace scale 1/20 noise is introduced in the aggregation mechanism, and is saved in the file /tmp/mnist_250_student_clean_votes_lap_20.npy after training.

Run an analysis script to understand the boundaries of the student model's protection for differential privacy and set noise_eps to 2/lap_scale:


python analysis.py --counts_file=/tmp/mnist_250_student_clean_votes_lap_20.npy --max_examples=1000 --delta=1e-5 --noise_eps=0.1 --input_is_counts


The above settings reproduce the PATE framework, and interested readers can view all the code on Github.


Conclusion

In machine learning applications, privacy should be considered an ally, not an enemy. With the improvement of technology. Differential privacy is likely to be an effective regularization tool that produces a better behavioral model. For machine learning researchers, even if they don't understand the knowledge of privacy protection, they can protect the training data in machine learning through the PATE framework.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.