Reprint ︱ Case Study on feature selection based on greedy algorithm

Source: Internet
Author: User

Reprint ︱ case based on the feature selection of the greedy algorithm reprint ︱ case design based on the greedy algorithm for feature selection the shortest journey between 22 locations-R language implementation —————————————————————————————————————————————————————— --greedy algorithm Feature Selection


Greedy algorithm (also known as greedy algorithm) refers to, in the problem solving, always make the best choice at present. In other words, not considering the overall optimal, it makes a local optimal solution in a sense. Greedy algorithm is not to all problems can get the overall optimal solution, the key is the choice of greedy strategy, the choice of greedy strategy must have no effect, that a state of the previous process will not affect the future state, only related to the current state.

Algorithm design:

    1. The target value of the initialization problem

    2. while (constraints that implement the optimization goal) {

      A feasible solution of solution space is obtained by using the filtering strategy.

      }

    3. All feasible solutions are combined into the target solution space.

Options (warn =-1) require (MAGRITTR) require (DPLYR) require (glmnet) # Greedy algorithmgreedyalgorithm = function (dataSet { # based on logistic regression, using AUC as the evaluation index and greedy algorithm for feature screening  #  # Args:  #   dataset:a dataframe that contains A feat Ure "label"  #  # Returns:  #   A vector of selected features  features = data.frame (name = Coln Ames (DataSet))%>%    dplyr::filter (name! = "label")  # Select all features of the DataSet except "label  " features = As.vector (features$name)      featureselect = C (" label ")  # init the feature vector To is selected  scorebefore = Data.frame ()  # init the storage Whice stores the (feature,aucscore) tuple from T He end of each iteration  while ((Nrow (scorebefore) <2| | Scorebefore[length (Scorebefore),2]>        scorebefore[length (Scorebefore)-+]) && Nrow (Scorebefore) <length (features)) {   score = Data.frame ()    for (FeatuRe in features) {     if (Length (intersect (feature,featureselect)) = = 0) {       traind ATA = Dataset[,append (featureselect,feature)]        model = GLM (label~.,family = "binomial", data = TR Aindata,epsilon = 1e-10)        prediction = Predict (model,traindata)       &NBSP;AUCVA Lue = AUC (traindata$label,prediction)        score = Rbind (score,data.frame (feature = feature, Aucvalue = aucvalue))      }    }    featureselect = unique (Append (featureselect,as.ch Aracter (Score[which.max (Score$aucvalue), 1]))    scorebefore = Rbind (Scorebefore,score[which.max (score$ Aucvalue),])  }  featureselect = Head (featureselect,length (Featureselect)-1)  # Delete the last feature That can ' t fit the Iteration condition  return (featureselect[-1])  # reture the selected features except "label "}

The KS values characterize the ability of the model to differentiate between positive and negative examples. The larger the value, the better the prediction accuracy of the model. In general, ks>0.3 can assume that the model has better predictive accuracy.

KS Value calculation method:

All samples were divided into n groups according to the forecast score from low to high, and the actual good sample number, bad sample number, accumulated good sample number, accumulated bad sample number, accumulated good sample count, accumulated bad sample count, and difference were calculated respectively in this n group. Among them, the actual good or bad samples of the group are good or bad sample number, accumulated good or bad sample number for this group accumulated samples of good or bad, the cumulative quality of samples accounted for the cumulative quality of samples accounted for the ratio of the total sample number, the difference is the cumulative bad sample count minus the cumulative number of samples accounted for. The KS indicator is the maximum of the absolute value of the difference.

# ksvalueksvalue = function (prediction,n) { # Compute the KS value of a model  #  # Args:  #   P Rediction:a vector that the prediction of A model  #   n:the Group number  #  # Returns:  # & nbsp A vector, the difference value between the rate, cumulative bad sample, and the rate of cumulative good sample  da Taresult = sort (prediction,decreasing = T)  a = C ()  b = C ()  c = C ()  a[1] = 0  b[1] = 0  c [1] = 0  if (length (dataresult)%%n==0) {   cut = length (Dataresult)/n    for (i in 2: (n+1)) {&nbs P    a[i] = SUM (dataresult[(cut* (i-2) +1):(cut* (i-1))])      b[i] = Length (dataresult[(cut* (i-2) +1):(cut* (i-1)])-a[i]    }  }else{   cut = round (Length (Dataresult)/n)    for (i in 2 : N) {     a[i] = SUM (dataresult[(cut* (i-2) +1):(cut* (i-1))])      b[i] = Length (dataresult[ (cut* (i-2) +1):(cut* (I-1)])-a[i]    }    a[n+1] = SUM (dataresult[(cut* (n-2) +1):(cut* (n-1))])   &NBSP;B[N+1] = Length (dataresult[(cut* (n-2) +1):(cut* (n-1))])-a[n+1]  }  c = ABS (Cumsum (a)/sum (a)-cumsum (b)/sum (b))  return (c)}
Require (caret) require (pROC) data = read.csv ("/data/workspace/rworkspace/data_test.csv", encoding = "UTF-8") data%< >%  mutate (label = IfElse (target>30,1,0)) data = Data[,-1]data = Data.frame (Apply (data, 2, function (x) IfElse (  Is.na (x), median (x,na.rm = T), x)) # Remove the variable of the approximate constant # Feature1 = Nearzerovar (data) # data = data[,-feature1]# reject an argument of high correlation # Datacor = Cor (data) # Highcor = Findcorrelation (datacor,0.8) # data = data[,-highcor]# Feature selection using greedy algorithm # feature = Greedyalgorithm (data Set = data) load ("/data/workspace/rworkspace/featureselect.rdata")  # a large amount of data, the algorithm is more time-consuming to generate HTML, So the feature Set.seed (521) ind = Base::sample (2,nrow (data), Replace=t,prob=c (0.7,0.3)) Traindata = Data[ind==1, which was selected when loading the test directly,] TestData = Data[ind==2,]model = Cv.glmnet (As.matrix (Traindata[,feature]), traindata[, "label",                    family = "binomial", type.measure = "AUC", alpha = 0,       & nbsp            lambda.min.ratio = 0.0001) Prediction = Predict (Model,as.matrix (Testdata[,feature]), s= "Lambda.min", type= "response") # COMPUTE Ksvalueksvalue = Ksvalue (prediction,10) par (Mfrow = C (2,1)) plot (density (ksvalue), type = ' l ', main = "Ksvalue plot", Xlab = "Cutpoint", Ylab = "Density_ks") Ks_value = max (ksvalue) text (. 2,1.0,paste ("Ksvalue =", Ks_value)) Roc (Testdata$label, As.vector ( Prediction), AUC = T,plot = t,print.auc=t)

## ## Call:## roc.default(response = testData$label, predictor = as.vector(prediction),     auc = T, plot = T, print.auc = T)## ## Data: as.vector(prediction) in 5130 controls (testData$label 0) < 429 cases (testData$label 1).## Area under the curve: 0.7385
par(mfrow=c(1,1))

 

Reprint ︱ Case Study on feature selection based on greedy algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.