Predicting a Biological Response, kagglepredicting
Predict a biological response of molecules from their chemical properties
Predict biological reactions from the chemical properties of molecules.
The objective of the competition is to help us build as good a model as possible so that we can, as optimally as this data allows, relate molecular information, to an actual biological response.
We have shared the data in the comma separated values (CSV) format. each row in this data set represents a molecule. the first column contains experimental data describing an actual biological response; the molecule was seen to elicit this response (1), or not (0 ). the remaining columns represent molecular descriptors (d1 through d1776), these are calculated properties that can capture some of the characteristics of the molecule-for example size, shape, or elemental constitution. the descriptor matrix has been normalized.
Brief description: Given a csv file, each line represents a molecule. The first column indicates the actual biological reaction, which has a reaction (1) and no reaction (0 ). Columns from 2nd to 1777th represent the attributes of a molecule, such as the size, shape, or element.
The competition for this question has long ended, but you can still submit 5 results to view your score ranking. You only need to submit a result file in csv format.
If we see 0 and 1, we can determine that this is a binary classification problem.
For such a binary classification and multi-attribute problem, you should first try it with Logistic regression.
The following python code uses Logistic Regression for prediction:
#! /Usr/bin/env python # coding: UTF-8 ''' Created on August 1 @ author: zhaohf ''' from sklearn. linear_model import LogisticRegressionfrom numpy import genfromtxt, savetxtdef main (): dataset = genfromtxt (open ('.. /Data/train.csv ', 'R'), delimiter =', ', dtype = 'f8') [1:] test = genfromtxt (open ('.. /Data/test.csv ', 'R'), delimiter =', ', dtype = 'f8') [1:] target = [x [0] for x in dataset] train = [x [1:] for x in dataset] lr = LogisticRegression () lr. fit (train, target) predicted_probs = [[index + 1, x [1] for index, x in enumerate (lr. predict_proba (test)] savetxt ('.. /Submissions/lr_benchmark.csv ', predicted_probs, delimiter =', ', fmt =' % d, % F', header = 'molecule, PredictedProbability ', comments = '') if _ name _ = '_ main _': main ()
Through the loss function test, the final public score is 0.59425, which is a very bad score. Ranking hundreds outside the market.
Next, use SVM to try it out. The code is very similar.
#! /Usr/bin/env python # coding: UTF-8 ''' Created on August 1 @ author: zhaohf ''' from sklearn import svmfrom numpy import genfromtxt, savetxtdef main (): dataset = genfromtxt (open ('.. /Data/train.csv ', 'R'), delimiter =', ', dtype = 'f8') [1:] test = genfromtxt (open ('.. /Data/test.csv ', 'R'), delimiter =', ', dtype = 'f8') [1:] target = [x [0] for x in dataset] train = [x [1:] for x in dataset] svc = svm. SVC (probability = True) svc. fit (train, target) predicted_probs = [[index + 1, x [1] for index, x in enumerate (svc. predict_proba (test)] savetxt ('.. /Submissions/svm_benchmark.csv ', predicted_probs, delimiter =', ', fmt =' % d, % F', header = 'moleculeid, PredictedProbability ', comments = '') if _ name _ = '_ main _': main ()
SVM scored 0.52553. It is slightly better than LR.
The best score on the rankings is 0.37356. You have to make some effort to get the best score in the competition.