Venn Diagram Comparison of Boruta, Fselectorrcpp and Glmnet algorithms

Last Update:2016-06-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Feature selection is a process of extracting valuable features that has significant influenceon dependent VA Riable. This was still an active field of wandering. In this post I compare few feature selection algorithms:traditional GLM with regularization, computationally dem Anding Boruta andentropy based filter from Fselectorrcpp (free of Java/weka) the package. Checkout the comparison on Venn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and Pbiecek for inspiration for this comparison. I have a chance to use Boruta nad fselectorrcpp in action. Glmnet is here, improve Venn Diagram.

RTCGA data

Data used for the comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and Present genes ' expressions (RNASEQ) from Human sequenced genome. Datasets with RNASeq is available VIARTCGA.RNASEQ data package and originally were provided by the cancer genome Atlas. It's a great set of over thousand of features (1 gene expression = 1 continuous feature) that might has influence on V Arious aspects of human survival. Let's use the data for Breast cancer (Breast invasive carcinoma/brca) where we'll try to find valuable genes that has IMP Act on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supportedsource("https://bioconductor.org/biocLite.R")biocLite("RTCGA.rnaseq")

library(RTCGA.rnaseq)BRCA.rnaseq$bcr_patient_barcode <- substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode is the TCGA barcode from which we receive information whether a sample of the collected Rea Dings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use CASE:TCGA and the curse of Bigdata.

Glmnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can Be extended with regularization net to provide prediction and variables selection at the same time. We can assume this not valuable features would appear with equal to zero coefficient in the final model with best Regulariz ation parameter. Broader explanation can be found in the vignette of the Glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

Library(Domc)Registerdomc(Cores=6)Library(Glmnet)# fit the ModelCv.glmnet(X=As.matrix(Brca.rnaseq[,-1]),Y=Factor(Brca.rnaseq[,1]),Family="Binomial",Type.measure="Class",parallel = true) -> cvfit# extract feature names that has # non zero coefficiantnames< Span class= "P" > (which (coef (s =  "lambda.min" ) [, 1 != 0) ) [ -1 -> glmnet.features# span class= "n" >first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them-was the lamba.min parameter for WHI CH miss-classification error is minimal. Also try to use lambda.1se .

plot(cvfit)

Discussion about standardization for LASSO can is found here. I normally don ' t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standard Ization is problematic and is still a rapid field of the.

Transferred from: http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html

Venn Diagram Comparison of Boruta, Fselectorrcpp and Glmnet algorithms

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Venn Diagram Comparison of Boruta, Fselectorrcpp and Glmnet algorithms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Venn Diagram Comparison of Boruta, Fselectorrcpp and Glmnet algorithms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support