[Example of Sklearn]-category comparison

Source: Internet
Author: User
Tags svm

refrence:http://cloga.info/python/2014/02/07/classify_use_sklearn/

Load a data set

Here I use pandas to load the dataset, and the DataSet takes the Kaggle Titanic dataset and downloads train.csv.

Import= pd.read_csv ('train.csv'# replaces missing values with 0 Df.head ()

Age
passengerid survived Pclass Name Sex sibsp Parch Ticket Fare Cabin embarked
0 1 0 3 Braund, Mr. Owen Harris. Male 22 1 0 A/5 21171 7.2500 0 S
1 2 1 1 Cumings, Mrs John Bradley (Florence Briggs Th ... Female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss Laina. Female 26 0 0 Ston/o2. 3101282 7.9250 0 S
3 4 1 1 Futrelle, Mrs Jacques Heath (Lily may Peel) Female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr William Henry Male 35 0 0 373450 8.0500 0 S

5 rowsx12 Columns

Len (DF)891

You can see a total of 891 records in the training set, with 12 columns (one column survived is the target category). The dataset is divided into special collection and target classification set, two dataframe.

Exc_cols = [u'passengerid', u'survived', u'Name  'for with if not in=  = df['survived'].values

Due to the sklearn for efficiency, the accepted characteristic data type is dtype=np.float32 in order to obtain the best algorithm efficiency. Therefore, the characteristics of a category type need to be converted to vectors. Sklearn provides the Dictvectorizer class to convert the characteristics of a class into vectors. Dictvectorizer accepts records in the form of a list of dictionaries. It is therefore necessary to convert the dataframe using the Pandas To_dict method.

 from Import  == V.fit_transform (x.to_dict (outtype='records')). ToArray ()

Let's compare the original information of the same instance with the result of vectorization.

Print 'vectorized:', x[10]Print 'unvectorized:', V.inverse_transform (x[10]) vectorized: [4. 0.0.  ..., 0. 0.0.] Unvectorized: [{'Fare': 16.699999999999999,'Name=sandstrom, Miss Marguerite Rut': 1.0,'Embarked=s': 1.0,' Age': 4.0,'Sex=female': 1.0,'Parch': 1.0,'Pclass': 3.0,'ticket=pp 9549': 1.0,'Cabin=g6': 1.0,'sibsp': 1.0,'Passengerid': 11.0}]

If the label of the classification is also a character, then it is necessary to use the Labelencoder method for conversion.

Divide the dataset into training and test sets.

 from Import  = train_test_split (x, y) len (data_train)668len (data_test)223

The default is to use 25% of the dataset as the test set. So far, the datasets for training and testing have been prepared.

The basic flow of Sklearn training model using Sklearn to make discriminant analysis
Model == = labelsmodel.predict (dataset.data)

Here's a comparison of naive Bayes, decision trees, random forests, and SVM.

 fromSklearnImportcross_validation fromSklearn.naive_bayesImportGAUSSIANNB fromSklearnImportTree fromSklearn.ensembleImportRandomforestclassifier fromSklearnImportSVMImportdatetimeestimators={}estimators['Bayes'] =gaussiannb () estimators['Tree'] =Tree. Decisiontreeclassifier () estimators['forest_100'] = randomforestclassifier (n_estimators = 100) estimators['forest_10'] = Randomforestclassifier (n_estimators = 10) estimators['SVM_C_RBF'] =SVM. SVC () estimators['Svm_c_linear'] = SVM. SVC (kernel='Linear') estimators['Svm_linear'] =SVM. Linearsvc () estimators['svm_nusvc'] = SVM. Nusvc ()

The first is to define the algorithms used for each model.

 forKinchEstimators.keys (): Start_time=Datetime.datetime.now ()Print '----%s----'%k Estimators[k]=Estimators[k].fit (Data_train, target_train) pred=estimators[k].predict (data_test)Print("%s Score:%0.2f"%(k, Estimators[k].score (Data_test, Target_test))) scores= Cross_validation.cross_val_score (Estimators[k], data_test, Target_test, cv=5)    Print("%s cross AVG. Score:%0.2f (+/-%0.2f)"% (k, Scores.mean (), SCORES.STD () * 2)) End_time=Datetime.datetime.now () time_spend= End_time-start_timePrint("%s Time:%0.2f"% (k, time_spend.total_seconds ()))

----svm_c_rbf----svm_c_rbf Score: 0.63svm_c_rbf Cross Avg. Score: 0.54 (+/- 0.18)svm_c_rbf Time: 1.67----tree----tree Score: 0.81tree Cross Avg. Score: 0.75 (+/- 0.09)tree Time: 0.90----forest_10----forest_10 Score: 0.83forest_10 Cross Avg. Score: 0.80 (+/- 0.10)forest_10 Time: 0.56----forest_100----forest_100 Score: 0.84forest_100 Cross Avg. Score: 0.80 (+/- 0.14)forest_100 Time: 5.38----svm_linear----svm_linear Score: 0.74svm_linear Cross Avg. Score: 0.65 (+/- 0.18)svm_linear Time: 0.15----svm_nusvc----svm_nusvc Score: 0.63svm_nusvc Cross Avg. Score: 0.55 (+/- 0.21)svm_nusvc Time: 1.62----bayes----bayes Score: 0.44bayes Cross Avg. Score: 0.47 (+/- 0.07)bayes Time: 0.16----svm_c_linear----svm_c_linear Score: 0.83svm_c_linear Cross Avg. Score: 0.79 (+/- 0.14)svm_c_linear Time: 465.57

The score method and cross_validation of the algorithm are used to calculate the accuracy of the prediction.

You can see that the algorithm with the higher accuracy will also increase the time required. Cost-effective algorithms are random forests. Let's test it with the TEST.CSV data set given by Kaggle.

Test = pd.read_csv ('test.csv'== test.to_dict (outtype=' ) Records'= V.transform (test_d). ToArray ()

It is important to note that the test data also needs to undergo the same dictvectorizer conversion.

 for inch Estimators.keys ():     = Estimators[k].fit (x, y)    = estimators[k].predict (test_vec)    test['survived  '] = pred    '. csv', cols=['  Survived'passengerid'], Index=false)

Well, submit your results to Kaggle.

[Example of Sklearn]-category comparison

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.