Python implementation of text categorization-based on xgboost algorithm

Source: Internet
Author: User
Tags xgboost
Description Training set for the comment text, labeled as Pos,neu,neg three categories, train.csv first column content, the second column label. Python's Xgboost package installation method, the Internet has a lot of detailed introduction Parameters

Xgboost's author divides all the parameters into three categories: 1, General parameters: Macro function control. 2, Booster Parameters: Control Each step of the booster. 3, Learning Target parameters: Control the performance of training objectives.

1. General parameters: booster[default gbtree]:gbtree: Tree based model, Gbliner: Linear model silent[default 0]: 1 o'clock, silent mode is turned on, no information is exported Nthread [Default is the maximum possible number of threads]: This parameter is used for multithreading control, you should enter the system's kernel number. If you want to use the entire kernel of the CPU, then do not enter this parameter, the algorithm will automatically detect it

2. Booster parameters:

Only the tree booster is described here, because it behaves far better than linear booster, so linear booster rarely use eta[default 0.3]: and GBM learning parameters in the same way. The robustness of the model can be improved by reducing the weight of each step. The commonly used value is 0.2, 0.3 max_depth[default 6]: This value is the maximum depth of the tree. The larger the max_depth, the more specific and localized samples will be learned. The commonly used value is 6 gamma[The default 0]:gamma specifies the minimum loss function drop value required for node splitting. The larger the value of this parameter, the more conservative the algorithm. The value of this parameter is closely related to the loss function. subsample[Default 1]: This parameter controls the proportion of random samples for each tree. By reducing the value of this parameter, the algorithm is more conservative and avoids fitting. However, if the value is set too small, it may result in a less than fit. Common values: 0.7-1 colsample_bytree[default 1]: Used to control the percentage of the number of columns per random sample (each column is a feature). Common values: 0.7-1

3. Learning Target parameter objective[default Reg:linear]: This parameter defines the loss function that needs to be minimized. Binary:logistic two classifies the logical regression and returns the probability of prediction. Multi:softmax uses Softmax's multiple classifiers to return the predicted category. In this case, you need to set one more parameter: Num_class (number of categories). Multi:softprob is the same as the Multi:softmax parameter, but returns the probability that each data belongs to each category. the default value for eval_metric[depends on the value of the objective parameter]: A measure of valid data. For a regression problem, the default value is Rmse, and for the classification problem, the default value is error. Other values: Rmse RMS error, mae mean absolute error, logloss negative logarithm likelihood function value, error two classification error rate (threshold 0.5), Merror Multiple classification error rate, mlogloss multiple classification Logloss loss function, AUC curve under the area. seed[Default 0]: Random number of seed settings it can reproduce the results of random data. Experiment

Code

#-*-Coding:utf-8-*-import xgboost as XGB import CSV import Jieba jieba.load_userdict (' wordDict.txt ') import NumPy as NP from Sklearn.feature_extraction.text import Countvectorizer from Sklearn.feature_extraction.text Import Tfidftransformer # Read Training set def readtrain (): With open (' train.csv ', ' RB ') as Csvfile:reader = Csv.reader (csvfi Le) column1 = [row for row in reader] Content_train = [i[1] for I in Column1[1:]] # first column text content, and strip name opinion _train = [i[2] for I in Column1[1:]] # second column category, and strip name print ' training set has%s sentence '% len (content_train) train = [Content_trai N, Opinion_train] Return train # Converts the list of UTF8 to Unicode def changelistcode (b): a = [] for i in B:a.appe nd (I.decode (' UTF8 ')) return a # to the list participle and the space to connect def segmentword (cont): c = [] for i in cont:a = List ( Jieba.cut (i)) b = "". Join (a) C.append (b) Return C # category is represented by numbers: pos:2,neu:1,neg:0 def translabel (labels
   ): For I in range (len (labels)):     If labels[i] = = ' pos ': labels[i] = 2 elif labels[i] = = ' neu ': labels[i] = 1 E Lif labels[i] = = ' neg ': labels[i] = 0 else:print "label Invalid:", Labels[i] return labels train = Rea Dtrain () content = Segmentword (train[0]) opinion = Translabel (train[1]) # need to represent categories with numbers opinion = Np.array (Opinion) # requires n Umpy format train_content = content[:7000] train_opinion = opinion[:7000] Test_content = content[7000:] test_opinion = Opinio n[7000:] Vectorizer = Countvectorizer () Tfidftransformer = Tfidftransformer () TFIDF = Tfidftransformer.fit_transform (ve Ctorizer.fit_transform (train_content)) weight = Tfidf.toarray () print Tfidf.shape TEST_TFIDF = Tfidftransformer.transform (Vectorizer.transform (test_content)) Test_weight = Test_tfidf.toarray () print Test_ Weight.shape Dtrain = xgb. Dmatrix (weight, label=train_opinion) Dtest = XGB. Dmatrix (Test_weight, label=test_opinion) # label can not, here needs to test the effect param = {' max_depth ': 6, ' eta ': 0.5, ' Eval_metRic ': ' Merror ', ' silent ': 1, ' objective ': ' Multi:softmax ', ' Num_class ': 3} # parameter evallist = [(Dtrain, ' Train '), (Dtest, ' "Test ') )] # This step can not be used for test effect num_round = 50 # cyclic times BST = Xgb.train (param, Dtrain, Num_round, evallist) Preds = bst.predict (dtest)

Output

Building prefix dict from the default dictionary ...
Loading model from the cache C:\users\www\appdata\local\temp\jieba.cache Loading model cost 0.366 seconds.

Prefix Dict has been built succesfully.   The training set has 10,981 sentences (7000, 14758) (3981L, 14758L) [0] train-merror:0.337857 test-merror:0.409194 [1] train-merror:0.322000 test-merror:0.401658 [2] train-merror:0.312429 test-merror:0.401909 [3] train-merror:0.300857 test-merror:0.387340 [4    ] train-merror:0.293143 test-merror:0.389601 [5] train-merror:0.286286 test-merror:0.390857 [6] train-merror:0.279000
test-merror:0.388847 [7] train-merror:0.270571 test-merror:0.387340 [8] train-merror:0.263857 test-merror:0.379804 [9] train-merror:0.257286 test-merror:0.376036 [a] train-merror:0.248000 test-merror:0.374278 [one] Train-merror : 0.241857 test-merror:0.371012 [[] train-merror:0.237000 test-merror:0.369254 [[] train-merror:0.231571 test -merror:0.366491 [] train-merror:0.225857 test-merror:0.365737 [15] train-merror:0.220286 test-merror:0.365988 [[] train-merror:0.216286 test-merror:0.364732 [] Train-merr or:0.212286 test-merror:0.360462 [[] train-merror:0.210143 test-merror:0.357699 [] train-merror:0.205143 te st-merror:0.356694 [A] train-merror:0.202286 test-merror:0.357699 [a] train-merror:0.198571 test-merror:0.3582 [A] train-merror:0.195429 test-merror:0.356443 [a] train-merror:0.192143 test-merror:0.358955 [a] train    -merror:0.189286 test-merror:0.358955 [A] train-merror:0.186571 test-merror:0.354936 [a] train-merror:0.183429 test-merror:0.353680 [A] train-merror:0.181714 test-merror:0.353429 [] train-merror:0.178286 test-merror:0    [353680] train-merror:0.174143 test-merror:0.352675 [] train-merror:0.172286 test-merror:0.352675 [31] train-merror:0.171286 test-merror:0.353680 [A] train-merror:0.168857 test-merror:0.354434 [a] train-merror:0.1 67429 test-merror:0.352675 [A] train-merror:0.164286 test-merror:0.350917 [train-merror:0.160714 test-merror:0.348907 [] tra in-merror:0.159000 test-merror:0.346898 [Notoginseng] train-merror:0.157571 test-merror:0.346395 [PDF] train-merror:0.1562 test-merror:0.347400 [m] train-merror:0.154571 test-merror:0.346647 [] train-merror:0.153714 Test-merror    : 0.345642 [A] train-merror:0.152857 test-merror:0.346647 [a] train-merror:0.150000 test-merror:0.345391 [43] train-merror:0.148143 test-merror:0.345893 [[] train-merror:0.145857 test-merror:0.344135 [] train-merror:0 144000 test-merror:0.341874 [[] train-merror:0.143000 test-merror:0.342879 [a] train-merror:0.142714 test-m error:0.341874 [[] train-merror:0.141714 test-merror:0.341372 [] train-merror:0.138286 test-merror:0.339362
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.