spent a day on lime papers:Http://arxiv.org/pdf/1602.04938v1.pdfcareful reading and code reading, experiments, generally understand the author's design ideas.
background:
When we build the model, we often think that our model is not stable enough, will there be sample bias effect, p>>n time will not cross-fit? We check the model stability, we do some cross-validation to see the variance of the evaluation indicators is not big. However, if the sample is initially biased due to sampling bias, which results in a difference between the model and the actual situation, this is not a good assessment. Similarly, P>>n will have similar problems, especially in the field of text mining. Generally, if the characteristics are not many, especially like the model of the logistic regression, we will print out the weights of the models and see if the model results are consistent with the human experience. The following is a case of text categorization in the lime article, predicting whether a piece of text is atheism-related or Christian-related. The results of the classifier prediction in this text is atheism-related, but the main distinguishing feature is not consistent with human experience, such a model is not convincing, when we remove these features, the prediction results are reversed. We can significantly reduce the performance of the model by manually constructing some text that is composed of these features to be added to the predictive experiment.
Lime Explanation principle:
Lime is the abbreviation for local interpretable model-agnostic explanations. The purpose of lime is to attempt to explain the behavior of the model on the predictive sample, which is understandable and is model-independent and does not need to go deep inside the model. The authors propose a method of partial, non-global, random sampling around each of the predicted samples to produce some samples, just as the red "X" is a predictive sample, the perimeter ' * ' and the Round sample are sampled.
The mechanism for sampling is to randomly replace several of the features in the original sample. such as text a= "My girlfriend very much like to see the wonderful Flower said," the generated sample can be "I very much like to see the wonderful Flower said", "my girlfriend see the wonderful Flower said" and so on. Each generated sample and the original sample have a weight, the weight of the calculation: W=exp (-d^2/theta^2), D is the distance, the text we can use cosine distance to characterize the distance between text samples.
The following is the code of the __data_labels_distances function in lime_text.py, for the interpretation of text text classification, the main function of the following code is how to generate a neighbor sample sample for the prediction sample, and the corresponding weight, the sampling sample in the current classifier prediction probability. The resulting sample representation is bag of Word: [0,1,0,0,1]. Noteat this time the sampled sample characteristics are not high-dimensional, and the maximum length is only the length of the predicted sample.
The sampling sample, as well as the weight of the sample sample, predicts the probability. With these things, what should we do next? Remember that our purpose is to explain how our classifiers work in this prediction sample? In simple words, what are the characteristics of the classifier in this prediction sample? We can set a value k beforehand, we only see the first K function (too many, people can not see)since it is a feature selection problem, we can use these sample samples to make a weighted regression model. Before the regression model, first select K Important special, how to choose? The method is either based on the maximum weight of the regression model training result, or the forward search method (each time you select the feature that r^2 the maximum value of the regression model into the collection. ), or using the Lasso_path method. Note that the weights of the samples are not the same. The following code can be seen in detail:
After selecting the K features, we can make a weighted regression model on the sampled samples and the K features. The k features and weights of the regression model output are the classifier's interpretation of the predicted samples. Here is the Explain_instance_with_data function code:
The above method can be summed up in general with the description on paper:
Summary:The above mainly revolves around text categorization and is mainly based on the text bag of Word mode. In fact, based on the representation of text embedding is also feasible, the replacement mechanism of text morphemes, just before the sampling sample classification probability needs to be converted to vector mode. In fact, it can be extended to many other areas, such as wind-control credit. To predict whether a behavior is risky, when our model predicts that the behavior is risky, we need to give our analysts, customer service, to explain why the behavior is risky, and what the model identifies the risk behavior characteristics. The nearest neighbor sampling mechanism for predicting samples may be optimized under design. More scenes many features are not discrete or binary, but continuous, especially like the random forest tree model is more suitable for processing continuous such variables. In response to this situation, how to deal with? How do I sample? A simple method is to discretization continuous features, One-hot coding, so that the lime to the text classification model interpretation of the sampling mechanism is the same. One is to take 0 samples of the feature exactly as the text does, regardless of whether it is a continuous variable. On the whole, lime to the model interpretation method is relatively simple, the paper description is slightly complex (originally very simple things why write so complex?). ), the paper is more from the experimental point of view to analyze the effectiveness of the lime method, there is not much theoretical analysis, people feel not very relieved (think about this method has what pits), after all, the experiment is based on the sample, in some complex scenarios are effective? There are more text and image scenes to experiment with, do other areas work? Why do the weighted regression analysis in sample samples of the prediction sample, the regression model result feature weight size can represent the original model in the Prediction sample performance?
LIME: Is the model predictive results trustworthy?