scikit learn 模組 調參 pipeline+girdsearch 資料舉例:文檔分類 (python代碼),

來源:互聯網
上載者:User

scikit learn 模組 調參 pipeline+girdsearch 資料舉例:文檔分類 (python代碼),
scikit learn 模組 調參 pipeline+girdsearch 資料舉例:文檔分類資料集 fetch_20newsgroups

 

#-*- coding: UTF-8 -*-import numpy as npfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import SGDClassifierfrom sklearn.grid_search import GridSearchCVfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.datasets import fetch_20newsgroupsfrom sklearn import metrics擷取待分類的文本資料來源categories = ['comp.graphics', 'comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware','comp.windows.x'];newsgroup_data = fetch_20newsgroups(subset = 'train',categories = categories)X,Y=np.array(newsgroup_data.data),np.array(newsgroup_data.target)Xtrain,Ytrain,Xtest,Ytest =X[0:2400],Y[0:2400],X[2400:],Y[2400:]#Pipeline主要用於將三個需要串列的模組串在一起,後一個模型處理前一個的結果'''#vect主要用於去音調、轉小寫、去停頓詞->tdidf主要用於計詞頻->clf分類模型'''pipeline_obj = Pipeline([('vect',CountVectorizer()),('tfidf',TfidfTransformer()),('clf',SGDClassifier()),])print "pipeline:",'\n', [name for name, _ in pipeline_obj.steps],'\n'#定義需要遍曆的所有候選參數的字典,key_name需要用__分隔模型名和模型內部的參數名'''parameters = {    'vect__max_df': (0.5, 0.75),'vect__max_features': (None, 5000, 10000),    'tfidf__use_idf': (True, False),'tfidf__norm': ('l1', 'l2'),    'clf__alpha': (0.00001, 0.000001), 'clf__n_iter': (10, 50) }print "parameters:",'\n',parameters,'\n'#GridSearchCV用於尋找vectorizer詞頻統計, tfidftransformer特徵變換和SGD classifier分類模型的最優參數grid_search = GridSearchCV( pipeline_obj, parameters, n_jobs = 1,verbose=1 )print 'grid_search','\n',grid_search,'\n' #輸出所有參數名及參數候選值grid_search.fit(Xtrain,Ytrain),'\n'#遍曆執行候選參數,尋找最優參數best_parameters = dict(grid_search.best_estimator_.get_params())#get執行個體中的最優參數for param_name in sorted(parameters.keys()):    print("\t%s: %r" % (param_name, best_parameters[param_name])),'\n'#輸出最有參數結果pipeline_obj.set_params(clf__alpha = 1e-05,clf__n_iter = 50,tfidf__use_idf = True,vect__max_df = 0.5,vect__max_features = None)#將pipeline_obj執行個體中的參數重寫為最優結果'''print pipeline_obj.named_steps#用最優參數訓練模型'''pipeline_obj.fit(Xtrain,Ytrain)pred = pipeline_obj.predict(Xtrain)print '\n',metrics.classification_report(Ytrain,pred)pred = pipeline_obj.predict(Xtest)print '\n',metrics.classification_report(Ytest,pred)

執行結果:總共有96個參數排列組合候選組,每組跑3次模型進行交叉驗證,共計跑模型96*3=288次。

 

調參前VS調參後:

#參考

#http://blog.csdn.net/mmc2015/article/details/46991465
# http://blog.csdn.net/abcjennifer/article/details/23884761
# http://scikit-learn.org/stable/modules/pipeline.html
# http://blog.csdn.net/yuanyu5237/article/details/44278759

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.