"Kaggle" using random forest classification algorithm to solve biologial response problem

Source: Internet
Author: User
Tags lapack gfortran

Kaggle, get up.

Kaggle games rely on machines for automatic processing, and machine learning is almost a must-have skill. Getting Started with Kaggle the machine learning skills required is not in-depth, just need to have a basic understanding of the common methods of machine learning, for example, for a problem, you can realize that it is a classification problem AH or regression problem ah, Why the machine can be based on a matrix you entered to calculate the classification results AH.
In fact, sometimes it really lies in being willing to step out of that, once stepping out of that step, do and not do really is a difference.
Hacker way is through constant attempts to learn, so, machine learning, not practice, equal to the fart did not do.

Biological Response Contest Title Description

Predicting biological reactions based on the chemical properties of molecules
This game gives data in CSV format, each line corresponds to a molecule, the first column describes the actual biological reaction (category 0 and Category 1 respectively), the remaining columns are molecular characterization (such as sample size, shape and element composition, etc.) obtained by the molecule descriptor, the descriptor has been normalized.

First time Commit

The game is a two-dollar classification problem whose data has been extracted and selected to make preprocessing easier, and although the game is over, you can still submit a solution so you can see comparisons with the world's best data scientists.
Here, I use the random forest algorithm to train and predict, although the random forest is a more advanced classifier, but because of the Sklearn library, this makes the algorithm easy to use.
Here we do not need to know what the mathematical principles of these technologies are, and do experiments to understand the workings of this algorithm or tool. Here's how to run the program and then submit the resulting results to the kaggle.

 fromSklearn.ensembleImportRandomforestclassifier fromNumPyImportGenfromtxt, Savetxt def main():    #create the Training & test sets, skipping the header row with [1:]DataSet = Genfromtxt (Open (' Data/train.csv ',' R '), delimiter=', ', dtype=' F8 ')[1:] target = [x[0] forXinchDataSet] Train = [x[1:] forXinchDataSet] Test = genfromtxt (open (' Data/test.csv ',' R '), delimiter=', ', dtype=' F8 ')[1:]#create and train the random forest    #multi-core CPUs can use:rf = Randomforestclassifier (n_estimators=100, n_jobs=2)RF = Randomforestclassifier (n_estimators= -) Rf.fit (train, target) predicted_probs = [[Index +1, x[1]] forIndex, XinchEnumerate (Rf.predict_proba (test))] Savetxt (' Data/submission.csv ', Predicted_probs, delimiter=', ', fmt='%d,%f ', header=' moleculeid,predictedprobability ', comments ="')if__name__=="__main__": Main ()
Evaluation and cross-examination

If we're going to use a gradient tree lift (Gradient tree boosting) instead of a random forest algorithm, or a simpler linear model.
In this process, it is easy to import a method from Sklearn and generate a commit file, but how to compare its performance becomes a critical issue. If you make an adjustment to the model, it is impractical to submit it once. So we're going to do two things in turn:

Defining evaluation functions
Cross-examination

You always need some evaluation functions to determine how well your model is performing. Ideally, these evaluation functions are best measured in the same way as kaggle evaluation measures. In the game of this problem, the evaluation measure is the Log-loss function.

importas spdef llfun(act, pred):    1e-15    pred = sp.maximum(epsilon, pred)    pred = sp.minimum(1-epsilon, pred)    ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))    ll = ll * -1.0/len(act)    return ll

Finally, we need data to test our model. When we first submitted the results, Kaggle used the Log-loss function to compare your predictions with the real-world results, how do we test our model locally without testing the data?
Cross-validation is a workaround.

Cross-examination is a simple technique that uses part of the training data to test. In Sklearn, it builds a set of tools that generate cross-validation sets.
In the following code, 10 cross-validation sets are built, which store 10% of the training data and test the results of the algorithm.

 fromSklearn.ensembleImportRandomforestclassifier fromSklearnImportCross_validationImportLoglossImportNumPy asNp def main():    #read in data, parse into training and target setsDataSet = Np.genfromtxt (Open (' Data/train.csv ',' R '), delimiter=', ', dtype=' F8 ')[1:] target = Np.array ([x[0] forXinchDataSet]) train = Np.array ([x[1:] forXinchDataSet])#In This case we'll use a random forest, but this could is any classifierCFR = Randomforestclassifier (n_estimators= -)#Simple k-fold Cross validation. 5 folds.CV = cross_validation. Kfold (Len (train), k=5, indices=False)#iterate through the training and test cross validation segments and    #run the classifier on each one, aggregating the results into a listresults = [] forTRAINCV, TESTCVinchCv:probas = Cfr.fit (Train[traincv], TARGET[TRAINCV]). Predict_proba (TRAIN[TESTCV]) results.append (LOGLOSS.L Lfun (TARGET[TESTCV], [x[1] forXinchProbas]))#print out the mean of the cross-validated results    Print "Results:"+ STR (Np.array (results). mean ())if__name__=="__main__": Main ()

It is worth mentioning that the results of cross-validation may not be the same as Kaggle's scoring for you because:

Random forest random components will make each result different;
The actual test data may deviate from the training data, especially in the case of small amount of data, training data may not reflect the overall characteristics of the data distribution;
Different implementations of the validation method can also make the results differ

Add: Installation of Python scientific computing environment under Linux

Using Python for scientific calculations requires three main packages: NumPy, scipy, Scikit-learn, Matplotlib.

Installing scipy

You can download the Python Package management tool pip for installation, but there is a bit of a problem in installing scipy.
SciPy needs the support of Lapack and Blas.
These two math libraries are a lot of Linux science computing software that needs to be called.
LAPACK, named Linear Algebra package, is a set of functions written in Fortran for numerical computations. It provides a rich set of tool functions, which can be used to solve the problems such as solving multiple linear equations, the least square solution of linear system systems, calculating eigenvector, householder conversion for computing QR decomposition of matrices, and singular value decomposition.
BLAS, the full name basic Linear Algebra Subprograms, the base linear algebra sub-Library, contains a large number of programs already written about linear algebra operations.
Procedures for installing lapack and Blas:

  1. Download Lapack.tgz package, unzip to local
  2. Go to Lapack folder
  3. Copy the Make.inc.example to Make.inc. The file is a compilation configuration file.
  4. Compile Blas with the make Blaslib command
  5. Compile lapack with the make Lapacklib command
  6. Finally get two lib files: Librefblas.a and LIBLAPACK.A
    Add: This is a FORTRAN library that requires Gfortran support, I installed Fortran compiler by sudo apt-get install gfortran-4.8 and sudo apt-get install Gfortran

I am under the mint system, using the sudo apt-get install Liblapack-dev can also be easily installed, eliminating the hassle of compiling.
Try this several times, using PIP install scipy still have a problem, the hint is that the Python.h header file cannot be found.
Finally found a installation scipy The simplest solution, use the sudo apt-get install python-scipy, so you can install scipy with one click. Because the software dependencies are analyzed here, the additional packages that need to be installed are installed, eliminating the complexity of one installation.

Installing Matplotlib

Installing Matplotlib under Linux is much more troublesome than installing in Windows, I first downloaded the source package for installation.
FreeType and PNG are required when using Python setup.py install, so I installed Libpng-dev and Libfreetype6-dev using Apt-get.

"Error trying to exec ' cc1plus ': execvp:no such file or directory" problem, the solution is sudo apt-get install build-essential.

In the end, however, the software source was tuned and installed using the sudo apt-get install python-matplotlib. After the installation test, found also need to install PYTHON-TK, so the use of Apt-get installed a bit, and eventually can be used normally.

Eclipse Development Environment

Once you have downloaded the free version of Eclipse, you can unzip it into the/usr/local/directory.
Then create a desktop shortcut, sudo vim/usr/share/applications/eclipse.desktop
belongs to the following text:

[Desktop Entry]Name=EclipseComment=Eclipse SDKEncoding=UTF-8Exec=/usr/local/eclipse/eclipseIcon=/usr/local/eclipse/icon.xpmTerminal=falseType=ApplicationCategories=Application;Development;

Copy the file to the desktop to open Eclipse from the desktop shortcut.
After you open eclipse, install Pydev and then configure it to work properly.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage

"Kaggle" using random forest classification algorithm to solve biologial response problem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.