Reprint ︱ Case Study on feature selection based on greedy algorithm

Last Update:2017-02-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint ︱ case based on the feature selection of the greedy algorithm reprint ︱ case design based on the greedy algorithm for feature selection the shortest journey between 22 locations-R language implementation —————————————————————————————————————————————————————— --greedy algorithm Feature Selection

Greedy algorithm (also known as greedy algorithm) refers to, in the problem solving, always make the best choice at present. In other words, not considering the overall optimal, it makes a local optimal solution in a sense. Greedy algorithm is not to all problems can get the overall optimal solution, the key is the choice of greedy strategy, the choice of greedy strategy must have no effect, that a state of the previous process will not affect the future state, only related to the current state.

Algorithm design:

The target value of the initialization problem
while (constraints that implement the optimization goal) {
A feasible solution of solution space is obtained by using the filtering strategy.
}
All feasible solutions are combined into the target solution space.

Options (warn =-1) require (MAGRITTR) require (DPLYR) require (glmnet) # Greedy algorithmgreedyalgorithm = function (dataSet { # based on logistic regression, using AUC as the evaluation index and greedy algorithm for feature screening  #  # Args:  #   dataset:a dataframe that contains A feat Ure "label"  #  # Returns:  #   A vector of selected features  features = data.frame (name = Coln Ames (DataSet))%>%    dplyr::filter (name! = "label")  # Select all features of the DataSet except "label  " features = As.vector (features$name)      featureselect = C (" label ")  # init the feature vector To is selected  scorebefore = Data.frame ()  # init the storage Whice stores the (feature,aucscore) tuple from T He end of each iteration  while ((Nrow (scorebefore) <2| | Scorebefore[length (Scorebefore),2]>        scorebefore[length (Scorebefore)-+]) && Nrow (Scorebefore) <length (features)) {   score = Data.frame ()    for (FeatuRe in features) {     if (Length (intersect (feature,featureselect)) = = 0) {       traind ATA = Dataset[,append (featureselect,feature)]        model = GLM (label~.,family = "binomial", data = TR Aindata,epsilon = 1e-10)        prediction = Predict (model,traindata)       &NBSP;AUCVA Lue = AUC (traindata$label,prediction)        score = Rbind (score,data.frame (feature = feature, Aucvalue = aucvalue))      }    }    featureselect = unique (Append (featureselect,as.ch Aracter (Score[which.max (Score$aucvalue), 1]))    scorebefore = Rbind (Scorebefore,score[which.max (score$ Aucvalue),])  }  featureselect = Head (featureselect,length (Featureselect)-1)  # Delete the last feature That can ' t fit the Iteration condition  return (featureselect[-1])  # reture the selected features except "label "}

The KS values characterize the ability of the model to differentiate between positive and negative examples. The larger the value, the better the prediction accuracy of the model. In general, ks>0.3 can assume that the model has better predictive accuracy.

KS Value calculation method:

All samples were divided into n groups according to the forecast score from low to high, and the actual good sample number, bad sample number, accumulated good sample number, accumulated bad sample number, accumulated good sample count, accumulated bad sample count, and difference were calculated respectively in this n group. Among them, the actual good or bad samples of the group are good or bad sample number, accumulated good or bad sample number for this group accumulated samples of good or bad, the cumulative quality of samples accounted for the cumulative quality of samples accounted for the ratio of the total sample number, the difference is the cumulative bad sample count minus the cumulative number of samples accounted for. The KS indicator is the maximum of the absolute value of the difference.

# ksvalueksvalue = function (prediction,n) { # Compute the KS value of a model  #  # Args:  #   P Rediction:a vector that the prediction of A model  #   n:the Group number  #  # Returns:  # & nbsp A vector, the difference value between the rate, cumulative bad sample, and the rate of cumulative good sample  da Taresult = sort (prediction,decreasing = T)  a = C ()  b = C ()  c = C ()  a[1] = 0  b[1] = 0  c [1] = 0  if (length (dataresult)%%n==0) {   cut = length (Dataresult)/n    for (i in 2: (n+1)) {&nbs P    a[i] = SUM (dataresult[(cut* (i-2) +1):(cut* (i-1))])      b[i] = Length (dataresult[(cut* (i-2) +1):(cut* (i-1)])-a[i]    }  }else{   cut = round (Length (Dataresult)/n)    for (i in 2 : N) {     a[i] = SUM (dataresult[(cut* (i-2) +1):(cut* (i-1))])      b[i] = Length (dataresult[ (cut* (i-2) +1):(cut* (I-1)])-a[i]    }    a[n+1] = SUM (dataresult[(cut* (n-2) +1):(cut* (n-1))])   &NBSP;B[N+1] = Length (dataresult[(cut* (n-2) +1):(cut* (n-1))])-a[n+1]  }  c = ABS (Cumsum (a)/sum (a)-cumsum (b)/sum (b))  return (c)}

Require (caret) require (pROC) data = read.csv ("/data/workspace/rworkspace/data_test.csv", encoding = "UTF-8") data%< >%  mutate (label = IfElse (target>30,1,0)) data = Data[,-1]data = Data.frame (Apply (data, 2, function (x) IfElse (  Is.na (x), median (x,na.rm = T), x)) # Remove the variable of the approximate constant # Feature1 = Nearzerovar (data) # data = data[,-feature1]# reject an argument of high correlation # Datacor = Cor (data) # Highcor = Findcorrelation (datacor,0.8) # data = data[,-highcor]# Feature selection using greedy algorithm # feature = Greedyalgorithm (data Set = data) load ("/data/workspace/rworkspace/featureselect.rdata")  # a large amount of data, the algorithm is more time-consuming to generate HTML, So the feature Set.seed (521) ind = Base::sample (2,nrow (data), Replace=t,prob=c (0.7,0.3)) Traindata = Data[ind==1, which was selected when loading the test directly,] TestData = Data[ind==2,]model = Cv.glmnet (As.matrix (Traindata[,feature]), traindata[, "label",                    family = "binomial", type.measure = "AUC", alpha = 0,       & nbsp            lambda.min.ratio = 0.0001) Prediction = Predict (Model,as.matrix (Testdata[,feature]), s= "Lambda.min", type= "response") # COMPUTE Ksvalueksvalue = Ksvalue (prediction,10) par (Mfrow = C (2,1)) plot (density (ksvalue), type = ' l ', main = "Ksvalue plot", Xlab = "Cutpoint", Ylab = "Density_ks") Ks_value = max (ksvalue) text (. 2,1.0,paste ("Ksvalue =", Ks_value)) Roc (Testdata$label, As.vector ( Prediction), AUC = T,plot = t,print.auc=t)

## ## Call:## roc.default(response = testData$label, predictor = as.vector(prediction),     auc = T, plot = T, print.auc = T)## ## Data: as.vector(prediction) in 5130 controls (testData$label 0) < 429 cases (testData$label 1).## Area under the curve: 0.7385

par(mfrow=c(1,1))

Reprint ︱ Case Study on feature selection based on greedy algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reprint ︱ Case Study on feature selection based on greedy algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Reprint ︱ Case Study on feature selection based on greedy algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support