Reprint ︱ case based on the feature selection of the greedy algorithm reprint ︱ case design based on the greedy algorithm for feature selection the shortest journey between 22 locations-R language implementation —————————————————————————————————————————————————————— --greedy algorithm Feature Selection
Greedy algorithm (also known as greedy algorithm) refers to, in the problem solving, always make the best choice at present. In other words, not considering the overall optimal, it makes a local optimal solution in a sense. Greedy algorithm is not to all problems can get the overall optimal solution, the key is the choice of greedy strategy, the choice of greedy strategy must have no effect, that a state of the previous process will not affect the future state, only related to the current state.
Algorithm design:
The target value of the initialization problem
while (constraints that implement the optimization goal) {
A feasible solution of solution space is obtained by using the filtering strategy.
}
All feasible solutions are combined into the target solution space.
Options (warn =-1) require (MAGRITTR) require (DPLYR) require (glmnet) # Greedy algorithmgreedyalgorithm = function (dataSet { # based on logistic regression, using AUC as the evaluation index and greedy algorithm for feature screening # # Args: # dataset:a dataframe that contains A feat Ure "label" # # Returns: # A vector of selected features features = data.frame (name = Coln Ames (DataSet))%>% dplyr::filter (name! = "label") # Select all features of the DataSet except "label " features = As.vector (features$name) featureselect = C (" label ") # init the feature vector To is selected scorebefore = Data.frame () # init the storage Whice stores the (feature,aucscore) tuple from T He end of each iteration while ((Nrow (scorebefore) <2| | Scorebefore[length (Scorebefore),2]> scorebefore[length (Scorebefore)-+]) && Nrow (Scorebefore) <length (features)) { score = Data.frame () for (FeatuRe in features) { if (Length (intersect (feature,featureselect)) = = 0) { traind ATA = Dataset[,append (featureselect,feature)] model = GLM (label~.,family = "binomial", data = TR Aindata,epsilon = 1e-10) prediction = Predict (model,traindata) &NBSP;AUCVA Lue = AUC (traindata$label,prediction) score = Rbind (score,data.frame (feature = feature, Aucvalue = aucvalue)) } } featureselect = unique (Append (featureselect,as.ch Aracter (Score[which.max (Score$aucvalue), 1])) scorebefore = Rbind (Scorebefore,score[which.max (score$ Aucvalue),]) } featureselect = Head (featureselect,length (Featureselect)-1) # Delete the last feature That can ' t fit the Iteration condition return (featureselect[-1]) # reture the selected features except "label "}
The KS values characterize the ability of the model to differentiate between positive and negative examples. The larger the value, the better the prediction accuracy of the model. In general, ks>0.3 can assume that the model has better predictive accuracy.
KS Value calculation method:
All samples were divided into n groups according to the forecast score from low to high, and the actual good sample number, bad sample number, accumulated good sample number, accumulated bad sample number, accumulated good sample count, accumulated bad sample count, and difference were calculated respectively in this n group. Among them, the actual good or bad samples of the group are good or bad sample number, accumulated good or bad sample number for this group accumulated samples of good or bad, the cumulative quality of samples accounted for the cumulative quality of samples accounted for the ratio of the total sample number, the difference is the cumulative bad sample count minus the cumulative number of samples accounted for. The KS indicator is the maximum of the absolute value of the difference.
# ksvalueksvalue = function (prediction,n) { # Compute the KS value of a model # # Args: # P Rediction:a vector that the prediction of A model # n:the Group number # # Returns: # & nbsp A vector, the difference value between the rate, cumulative bad sample, and the rate of cumulative good sample da Taresult = sort (prediction,decreasing = T) a = C () b = C () c = C () a[1] = 0 b[1] = 0 c [1] = 0 if (length (dataresult)%%n==0) { cut = length (Dataresult)/n for (i in 2: (n+1)) {&nbs P a[i] = SUM (dataresult[(cut* (i-2) +1):(cut* (i-1))]) b[i] = Length (dataresult[(cut* (i-2) +1):(cut* (i-1)])-a[i] } }else{ cut = round (Length (Dataresult)/n) for (i in 2 : N) { a[i] = SUM (dataresult[(cut* (i-2) +1):(cut* (i-1))]) b[i] = Length (dataresult[ (cut* (i-2) +1):(cut* (I-1)])-a[i] } a[n+1] = SUM (dataresult[(cut* (n-2) +1):(cut* (n-1))]) &NBSP;B[N+1] = Length (dataresult[(cut* (n-2) +1):(cut* (n-1))])-a[n+1] } c = ABS (Cumsum (a)/sum (a)-cumsum (b)/sum (b)) return (c)}
Require (caret) require (pROC) data = read.csv ("/data/workspace/rworkspace/data_test.csv", encoding = "UTF-8") data%< >% mutate (label = IfElse (target>30,1,0)) data = Data[,-1]data = Data.frame (Apply (data, 2, function (x) IfElse ( Is.na (x), median (x,na.rm = T), x)) # Remove the variable of the approximate constant # Feature1 = Nearzerovar (data) # data = data[,-feature1]# reject an argument of high correlation # Datacor = Cor (data) # Highcor = Findcorrelation (datacor,0.8) # data = data[,-highcor]# Feature selection using greedy algorithm # feature = Greedyalgorithm (data Set = data) load ("/data/workspace/rworkspace/featureselect.rdata") # a large amount of data, the algorithm is more time-consuming to generate HTML, So the feature Set.seed (521) ind = Base::sample (2,nrow (data), Replace=t,prob=c (0.7,0.3)) Traindata = Data[ind==1, which was selected when loading the test directly,] TestData = Data[ind==2,]model = Cv.glmnet (As.matrix (Traindata[,feature]), traindata[, "label", family = "binomial", type.measure = "AUC", alpha = 0, & nbsp lambda.min.ratio = 0.0001) Prediction = Predict (Model,as.matrix (Testdata[,feature]), s= "Lambda.min", type= "response") # COMPUTE Ksvalueksvalue = Ksvalue (prediction,10) par (Mfrow = C (2,1)) plot (density (ksvalue), type = ' l ', main = "Ksvalue plot", Xlab = "Cutpoint", Ylab = "Density_ks") Ks_value = max (ksvalue) text (. 2,1.0,paste ("Ksvalue =", Ks_value)) Roc (Testdata$label, As.vector ( Prediction), AUC = T,plot = t,print.auc=t)
## ## Call:## roc.default(response = testData$label, predictor = as.vector(prediction), auc = T, plot = T, print.auc = T)## ## Data: as.vector(prediction) in 5130 controls (testData$label 0) < 429 cases (testData$label 1).## Area under the curve: 0.7385
par(mfrow=c(1,1))
Reprint ︱ Case Study on feature selection based on greedy algorithm