[Example of Sklearn]-category comparison

Last Update:2015-01-03 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

refrence:http://cloga.info/python/2014/02/07/classify_use_sklearn/

Load a data set

Here I use pandas to load the dataset, and the DataSet takes the Kaggle Titanic dataset and downloads train.csv.

Import= pd.read_csv ('train.csv'# replaces missing values with 0 Df.head ()

Age

	passengerid	survived	Pclass	Name	Sex		sibsp	Ticket	Fare	Cabin	embarked
0	1	0	3	Braund, Mr. Owen Harris.	Male	22	1	A/5 21171	7.2500	0	S
1	2	1	1	Cumings, Mrs John Bradley (Florence Briggs Th ...	Female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss Laina.	Female	26	0	Ston/o2. 3101282	7.9250	0	S
3	4	1	1	Futrelle, Mrs Jacques Heath (Lily may Peel)	Female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr William Henry	Male	35	0	373450	8.0500	0	S

5 rowsx12 Columns

Len (DF)891

You can see a total of 891 records in the training set, with 12 columns (one column survived is the target category). The dataset is divided into special collection and target classification set, two dataframe.

Exc_cols = [u'passengerid', u'survived', u'Name  'for with if not in=  = df['survived'].values

Due to the sklearn for efficiency, the accepted characteristic data type is dtype=np.float32 in order to obtain the best algorithm efficiency. Therefore, the characteristics of a category type need to be converted to vectors. Sklearn provides the Dictvectorizer class to convert the characteristics of a class into vectors. Dictvectorizer accepts records in the form of a list of dictionaries. It is therefore necessary to convert the dataframe using the Pandas To_dict method.

 from Import  == V.fit_transform (x.to_dict (outtype='records')). ToArray ()

Let's compare the original information of the same instance with the result of vectorization.

Print 'vectorized:', x[10]Print 'unvectorized:', V.inverse_transform (x[10]) vectorized: [4. 0.0.  ..., 0. 0.0.] Unvectorized: [{'Fare': 16.699999999999999,'Name=sandstrom, Miss Marguerite Rut': 1.0,'Embarked=s': 1.0,' Age': 4.0,'Sex=female': 1.0,'Parch': 1.0,'Pclass': 3.0,'ticket=pp 9549': 1.0,'Cabin=g6': 1.0,'sibsp': 1.0,'Passengerid': 11.0}]

If the label of the classification is also a character, then it is necessary to use the Labelencoder method for conversion.

Divide the dataset into training and test sets.

 from Import  = train_test_split (x, y) len (data_train)668len (data_test)223

The default is to use 25% of the dataset as the test set. So far, the datasets for training and testing have been prepared.

The basic flow of Sklearn training model using Sklearn to make discriminant analysis

Model == = labelsmodel.predict (dataset.data)

Here's a comparison of naive Bayes, decision trees, random forests, and SVM.

 fromSklearnImportcross_validation fromSklearn.naive_bayesImportGAUSSIANNB fromSklearnImportTree fromSklearn.ensembleImportRandomforestclassifier fromSklearnImportSVMImportdatetimeestimators={}estimators['Bayes'] =gaussiannb () estimators['Tree'] =Tree. Decisiontreeclassifier () estimators['forest_100'] = randomforestclassifier (n_estimators = 100) estimators['forest_10'] = Randomforestclassifier (n_estimators = 10) estimators['SVM_C_RBF'] =SVM. SVC () estimators['Svm_c_linear'] = SVM. SVC (kernel='Linear') estimators['Svm_linear'] =SVM. Linearsvc () estimators['svm_nusvc'] = SVM. Nusvc ()

The first is to define the algorithms used for each model.

 forKinchEstimators.keys (): Start_time=Datetime.datetime.now ()Print '----%s----'%k Estimators[k]=Estimators[k].fit (Data_train, target_train) pred=estimators[k].predict (data_test)Print("%s Score:%0.2f"%(k, Estimators[k].score (Data_test, Target_test))) scores= Cross_validation.cross_val_score (Estimators[k], data_test, Target_test, cv=5)    Print("%s cross AVG. Score:%0.2f (+/-%0.2f)"% (k, Scores.mean (), SCORES.STD () * 2)) End_time=Datetime.datetime.now () time_spend= End_time-start_timePrint("%s Time:%0.2f"% (k, time_spend.total_seconds ()))

----svm_c_rbf----svm_c_rbf Score: 0.63svm_c_rbf Cross Avg. Score: 0.54 (+/- 0.18)svm_c_rbf Time: 1.67----tree----tree Score: 0.81tree Cross Avg. Score: 0.75 (+/- 0.09)tree Time: 0.90----forest_10----forest_10 Score: 0.83forest_10 Cross Avg. Score: 0.80 (+/- 0.10)forest_10 Time: 0.56----forest_100----forest_100 Score: 0.84forest_100 Cross Avg. Score: 0.80 (+/- 0.14)forest_100 Time: 5.38----svm_linear----svm_linear Score: 0.74svm_linear Cross Avg. Score: 0.65 (+/- 0.18)svm_linear Time: 0.15----svm_nusvc----svm_nusvc Score: 0.63svm_nusvc Cross Avg. Score: 0.55 (+/- 0.21)svm_nusvc Time: 1.62----bayes----bayes Score: 0.44bayes Cross Avg. Score: 0.47 (+/- 0.07)bayes Time: 0.16----svm_c_linear----svm_c_linear Score: 0.83svm_c_linear Cross Avg. Score: 0.79 (+/- 0.14)svm_c_linear Time: 465.57

The score method and cross_validation of the algorithm are used to calculate the accuracy of the prediction.

You can see that the algorithm with the higher accuracy will also increase the time required. Cost-effective algorithms are random forests. Let's test it with the TEST.CSV data set given by Kaggle.

Test = pd.read_csv ('test.csv'== test.to_dict (outtype=' ) Records'= V.transform (test_d). ToArray ()

It is important to note that the test data also needs to undergo the same dictvectorizer conversion.

 for inch Estimators.keys ():     = Estimators[k].fit (x, y)    = estimators[k].predict (test_vec)    test['survived  '] = pred    '. csv', cols=['  Survived'passengerid'], Index=false)

Well, submit your results to Kaggle.

[Example of Sklearn]-category comparison

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More