refrence:http://cloga.info/python/2014/02/07/classify_use_sklearn/
Load a data set
Here I use pandas to load the dataset, and the DataSet takes the Kaggle Titanic dataset and downloads train.csv.
Import= pd.read_csv ('train.csv'# replaces missing values with 0 Df.head ()
|
passengerid |
survived |
Pclass |
Name |
Sex |
| Age
sibsp |
Parch |
Ticket |
Fare |
Cabin |
embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris. |
Male |
22 |
1 |
0 |
A/5 21171 |
7.2500 |
0 |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs John Bradley (Florence Briggs Th ... |
Female |
38 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss Laina. |
Female |
26 |
0 |
0 |
Ston/o2. 3101282 |
7.9250 |
0 |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs Jacques Heath (Lily may Peel) |
Female |
35 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr William Henry |
Male |
35 |
0 |
0 |
373450 |
8.0500 |
0 |
S |
5 rowsx12 Columns
Len (DF)891
You can see a total of 891 records in the training set, with 12 columns (one column survived is the target category). The dataset is divided into special collection and target classification set, two dataframe.
Exc_cols = [u'passengerid', u'survived', u'Name 'for with if not in= = df['survived'].values
Due to the sklearn for efficiency, the accepted characteristic data type is dtype=np.float32 in order to obtain the best algorithm efficiency. Therefore, the characteristics of a category type need to be converted to vectors. Sklearn provides the Dictvectorizer class to convert the characteristics of a class into vectors. Dictvectorizer accepts records in the form of a list of dictionaries. It is therefore necessary to convert the dataframe using the Pandas To_dict method.
from Import == V.fit_transform (x.to_dict (outtype='records')). ToArray ()
Let's compare the original information of the same instance with the result of vectorization.
Print 'vectorized:', x[10]Print 'unvectorized:', V.inverse_transform (x[10]) vectorized: [4. 0.0. ..., 0. 0.0.] Unvectorized: [{'Fare': 16.699999999999999,'Name=sandstrom, Miss Marguerite Rut': 1.0,'Embarked=s': 1.0,' Age': 4.0,'Sex=female': 1.0,'Parch': 1.0,'Pclass': 3.0,'ticket=pp 9549': 1.0,'Cabin=g6': 1.0,'sibsp': 1.0,'Passengerid': 11.0}]
If the label of the classification is also a character, then it is necessary to use the Labelencoder method for conversion.
Divide the dataset into training and test sets.
from Import = train_test_split (x, y) len (data_train)668len (data_test)223
The default is to use 25% of the dataset as the test set. So far, the datasets for training and testing have been prepared.
The basic flow of Sklearn training model using Sklearn to make discriminant analysis
Model == = labelsmodel.predict (dataset.data)
Here's a comparison of naive Bayes, decision trees, random forests, and SVM.
fromSklearnImportcross_validation fromSklearn.naive_bayesImportGAUSSIANNB fromSklearnImportTree fromSklearn.ensembleImportRandomforestclassifier fromSklearnImportSVMImportdatetimeestimators={}estimators['Bayes'] =gaussiannb () estimators['Tree'] =Tree. Decisiontreeclassifier () estimators['forest_100'] = randomforestclassifier (n_estimators = 100) estimators['forest_10'] = Randomforestclassifier (n_estimators = 10) estimators['SVM_C_RBF'] =SVM. SVC () estimators['Svm_c_linear'] = SVM. SVC (kernel='Linear') estimators['Svm_linear'] =SVM. Linearsvc () estimators['svm_nusvc'] = SVM. Nusvc ()
The first is to define the algorithms used for each model.
forKinchEstimators.keys (): Start_time=Datetime.datetime.now ()Print '----%s----'%k Estimators[k]=Estimators[k].fit (Data_train, target_train) pred=estimators[k].predict (data_test)Print("%s Score:%0.2f"%(k, Estimators[k].score (Data_test, Target_test))) scores= Cross_validation.cross_val_score (Estimators[k], data_test, Target_test, cv=5) Print("%s cross AVG. Score:%0.2f (+/-%0.2f)"% (k, Scores.mean (), SCORES.STD () * 2)) End_time=Datetime.datetime.now () time_spend= End_time-start_timePrint("%s Time:%0.2f"% (k, time_spend.total_seconds ()))
----svm_c_rbf----svm_c_rbf Score: 0.63svm_c_rbf Cross Avg. Score: 0.54 (+/- 0.18)svm_c_rbf Time: 1.67----tree----tree Score: 0.81tree Cross Avg. Score: 0.75 (+/- 0.09)tree Time: 0.90----forest_10----forest_10 Score: 0.83forest_10 Cross Avg. Score: 0.80 (+/- 0.10)forest_10 Time: 0.56----forest_100----forest_100 Score: 0.84forest_100 Cross Avg. Score: 0.80 (+/- 0.14)forest_100 Time: 5.38----svm_linear----svm_linear Score: 0.74svm_linear Cross Avg. Score: 0.65 (+/- 0.18)svm_linear Time: 0.15----svm_nusvc----svm_nusvc Score: 0.63svm_nusvc Cross Avg. Score: 0.55 (+/- 0.21)svm_nusvc Time: 1.62----bayes----bayes Score: 0.44bayes Cross Avg. Score: 0.47 (+/- 0.07)bayes Time: 0.16----svm_c_linear----svm_c_linear Score: 0.83svm_c_linear Cross Avg. Score: 0.79 (+/- 0.14)svm_c_linear Time: 465.57
The score method and cross_validation of the algorithm are used to calculate the accuracy of the prediction.
You can see that the algorithm with the higher accuracy will also increase the time required. Cost-effective algorithms are random forests. Let's test it with the TEST.CSV data set given by Kaggle.
Test = pd.read_csv ('test.csv'== test.to_dict (outtype=' ) Records'= V.transform (test_d). ToArray ()
It is important to note that the test data also needs to undergo the same dictvectorizer conversion.
for inch Estimators.keys (): = Estimators[k].fit (x, y) = estimators[k].predict (test_vec) test['survived '] = pred '. csv', cols=[' Survived'passengerid'], Index=false)
Well, submit your results to Kaggle.
[Example of Sklearn]-category comparison