菜鳥進階：C++實現Chi-square 特徵詞選擇演算法

最後更新：2018-12-07 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

作者：finallyliuyu(轉載請標明原作者與出處)

在文本分類問題中，離不開特徵詞選擇模組。特徵選取是特徵降維的關鍵步驟。

首先我們給出一般性的特徵詞選擇模組的虛擬碼描述：

（此圖摘自 C.D. Maning Introduction to InformationRetrieval 原版p251頁或者王斌譯版p188頁）

此處僅贅述兩點，其他還勞請讀者自己去看書

1。上面的虛擬碼給出的是演算法是針對某一個類別，按照某種測度（如IG,CHI-square）遴選出 top k個特徵詞；虛擬碼中的 ComputeFeatureUtility(D,t,c)。就是在計算上文提到的“某種測度”

2。針對某個分類問題，如何遴選出全部的特徵詞？

方法有很多，這裡僅指出一種：假設有N個類別，共需要選取K個特徵詞，那麼每個類別需要選取的特徵詞數目為K/N。

下面給出Chi-square的計算公式（出處同上，原版書p256頁，王斌譯作p192頁）：

上面的公式和下面的公式是等價的，可以由下面的公式推匯出上面的公式，在電腦實現上，我們通常採用上面的公式。

可以說上面的兩個公式，通通是在構造一個chi-square 分布的檢測統計量（test statistic）（在數理統計中 chi-square 常常用於檢測兩個事件之間的獨立性，如果獨立則 chi-square=0 相關知識請查閱數理統計關於假設檢驗的相關章節）

如果你和我一樣奇怪為啥這個 chi-square test statistic為啥長成這個樣子？請參閱我的部落格《尋根究底，探討 chi -square特徵詞選擇方法後面的數學支援》

下面開始講解chi-square特徵詞選擇法的具體實現

主流的contingency table的定義。

針對某一個term t 和類別c

N11:該詞出現在該類的多少篇文章中；

N10：該詞出現的文章有多少篇不再該類中；

N01：該類別中有多少篇文章不含有該詞；

N00:訓練語料庫中共有多少篇文章即不含該詞，也不包含在該類中。

在給出實現代碼之前，先來看一段對程式實現會有啟發作用的話：

（出處同上，p257頁）

這段話引出了一個資料結構：它儲存了一個詞在每個類別中出現和不出現的情況：比如有n個類別，那麼這個資料結構的每一行儲存的是：N11,N01。在My Code中，我把這個資料結構亦稱作是contingency table和主流的contingency table定義可能會稍有區別，不過既然有了N11,N01,在根據程式中其他的資料結構很容易能夠得到主流定義模式下的contingency table。

下面開始給出實現代碼（如果程式中的一些函數的代碼我沒有給出，請參閱《K-means文本聚類系列（已經完成）》裡面的相關函數）

用到的主要資料結構：

1。詞典：儲存一個詞在訓練語料集合中的每篇文章中出現的次數資料類型map<string,vector<pair<int,int>> >

2。contingency table（功能見上面敘述）資料類型：map<pair<string,string>,pair<int,int> >

map的鍵由兩個部分組成第一個string代表term, 第二個string代表類別，值中的第一個int 是 N11,第二個int 是N01

獲得contingency table的函數

/************************************************************************//* 獲得每個詞的ContingencyTable  *頂層map的索引值為詞的(term Text,classLabel)內層map的索引值為類別名稱pair<int,int>的第一個int表示某一類別c中含有term t的文章數目,第二個int表示該類別中不含有term t的文章數目*//************************************************************************/map<pair<string,string>,pair<int,int> >Preprocess::GetContingencyTable(map<string,vector<pair<int,int>> > &mymap, vector<string> classLabels){clock_t start,finish;double totaltime;start=clock(); map<string,vector<int> >articleIdsEachClass=GetArticleIdinEachClass(classLabels); map<pair<string,string>,pair<int,int> >EntireContigencytable;//對於詞袋子模型中的每個詞for(map<string,vector<pair<int,int> > >::iterator it=mymap.begin();it !=mymap.end();++it){   //對於每個類別if(it->first!=""||it->first!=" "){for(map<string,vector<int> >::iterator it1=articleIdsEachClass.begin();it1!=articleIdsEachClass.end();it1++){   int cntTheClass=(it1->second).size();//該類別共有文章數目int termInTheClass=0;//該詞在該類中出現的次數for(vector<pair<int,int> >::iterator it2=(it->second).begin();it2!=(it->second).end();it2++){termInTheClass+=count((it1->second).begin(),it1->second.end(),it2->first);}int termAbsentInTheClass=cntTheClass-termInTheClass;pair<string,string> compoundKey=make_pair(it->first,it1->first);pair<int,int> valueInfo=make_pair(termInTheClass,termAbsentInTheClass);EntireContigencytable[compoundKey]=valueInfo;termInTheClass=0;//清空計數；}}}finish=clock();totaltime=(double)(finish-start)/CLOCKS_PER_SEC;cout<<"建立contingencyTable的時間為"<<totaltime<<endl;return EntireContigencytable;}

由於構造contingency table 要遠比將構造好的contingency table序列化到硬碟，然後需要的時候讀取到記憶體的時間長（我的機器上：建立contingency table 曆時233.41sec，將contingency table從硬碟序列化到記憶體的時間為0.954 sec）所有這裡給出了針對contingency table序列化和還原序列化的函數

/************************************************************************//* 將關聯表儲存到本地硬碟                                                                     *//************************************************************************/void Preprocess::SaveContingencyTable(map<pair<string,string>,pair<int,int> >&contingencyTable){  ofstream outfile("F:\\Cluster\\contingency.dat",ios::binary);for(map<pair<string,string>, pair<int,int> >::iterator it=contingencyTable.begin();it!=contingencyTable.end();it++){outfile<<(it->first).first<<" "<<(it->first).second<<" "<<(it->second).first<<" "<<(it->second).second<<endl;}outfile.close();}/************************************************************************//* 將關聯表資訊從硬碟載入到記憶體                                                                     *//************************************************************************/void Preprocess::LoadContingencyTable(map<pair<string,string>,pair<int,int> >&contingencyTable){  clock_t start,finish;double totaltime;start=clock();ifstream infile("F:\\Cluster\\contingency.dat",ios::binary);string termtext="";string classLabel="";int presentNum=0;//該term 在該classLabel下的文章中出現的次數(不計算出現重數)int absentNum=0;//該classLabel下的文章中不含有該term的文章數目while(!infile.eof()){infile>>termtext;infile>>classLabel;infile>>presentNum;infile>>absentNum;pair<string, string> compoundKey=make_pair(termtext,classLabel);pair<int,int> valinfo=make_pair(presentNum,absentNum);contingencyTable[compoundKey]=valinfo;}infile.close();finish=clock();totaltime=(double)(finish-start)/CLOCKS_PER_SEC;cout<<"將contingencyTable載入到記憶體的時間為"<<totaltime<<endl;}

計算chi-square值的函數：

/************************************************************************//* 計算CHI-square 值                                                *//************************************************************************/double Preprocess:: CalChiSquareValue(double N11,double N10,double N01,double N00){double chiSquare=0;chiSquare=(N11+N10+N01+N00)*pow((N11*N00-N10*N01),2)/((N11+N01)*(N11+N10)*(N10+N00)*(N01+N00));return chiSquare;}

針對每個類別計算所有詞的chi-square並按照chi-square值按從高到低排列：

計算詞袋子中的每一個詞對某一類別的卡方值/************************************************************************/vector<pair<string,double> > Preprocess::ChiSquareFeatureSelectionForPerclass(map<string,vector<pair<int,int>> >&mymap,map<pair<string,string>,pair<int,int> > &contingencyTable,string classLabel){   int N=endIndex-beginIndex+1;//總共的文章數目vector<string>tempvector;//詞袋子中的所有詞vector<pair<string,double> > chisquareInfo;for(map<string,vector<pair<int,int>>>::iterator it=mymap.begin();it!=mymap.end();++it){tempvector.push_back(it->first);}//計算卡方值for(vector<string>::iterator ittmp=tempvector.begin();ittmp!=tempvector.end();ittmp++){int N1=mymap[*ittmp].size();pair<string,string> compoundKey=make_pair(*ittmp,classLabel);double N11=double(contingencyTable[compoundKey].first);double N01=double(contingencyTable[compoundKey].second);double N10=double(N1-N11);double N00=double(N-N1-N01);double chiValue=CalChiSquareValue(N11,N10,N01,N00);chisquareInfo.push_back(make_pair(*ittmp,chiValue));}//按照卡方值從大到小將這些詞排列起來stable_sort(chisquareInfo.begin(),chisquareInfo.end(),isLarger);/*ofstream outfile("F:\\Cluster\\other.dat");int finalKeyWordsCount=0;for(vector<pair<string,double> >::size_type j=0;j<chisquareInfo.size();j++){outfile<<chisquareInfo[j].first<<";"<<chisquareInfo[j].second<<endl;finalKeyWordsCount++;}outfile.close();*/return chisquareInfo;}

針對整個分類問題的chi-square特徵詞選擇法。在本例中，共有三個類別

/************************************************************************//* 卡方特徵詞選擇演算法                                                                     *//************************************************************************/void Preprocess::ChiSquareFeatureSelection(map<string,vector<pair<int,int>> > &mymap,map<pair<string,string>,pair<int,int> > &contingencyTable,int N){clock_t start,finish;double totaltime;start=clock();int N1=18693;int N2=23822;int N3=15717;int threshold1=N1*N/(N1+N2+N3);int threshold2=N2*N/(N1+N2+N3);int threshold3=N3*N/(N1+N2+N3);string classlabel1="xxxx";string classlabel2="yyyy";string classlabel3="zzzz";vector<string> classLabels;classLabels.push_back("xxxx");classLabels.push_back("yyyy");classLabels.push_back("zzzz");vector<pair<string,double>>chisquareInfo1;vector<pair<string,double>>chisquareInfo2;vector<pair<string,double>>chisquareInfo3;chisquareInfo1=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel1);chisquareInfo2=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel2);chisquareInfo3=ChiSquareFeatureSelectionForPerclass(mymap,contingencyTable,classlabel3);//stable_sort(chisquareInfo2.begin(),chisquareInfo2.end(),isLarger);//stable_sort(chisquareInfo3.begin(),chisquareInfo3.end(),isLarger);cout<<"finish ChiSquare Calculation"<<endl;set<string>finalKeywords;for(vector<pair<string,double> >::size_type j=0;j<threshold1;j++){finalKeywords.insert(chisquareInfo1[j].first);}for(vector<pair<string,double> >::size_type j=0;j<threshold2;j++){finalKeywords.insert(chisquareInfo2[j].first);}for(vector<pair<string,double> >::size_type j=0;j<threshold2;j++){finalKeywords.insert(chisquareInfo3[j].first);}ofstream outfile(featurewordsAddress);int finalKeyWordsCount=finalKeywords.size();for (set<string>::iterator it=finalKeywords.begin();it!=finalKeywords.end();it++){outfile<<*it<<endl;}outfile.close();cout<<"最後共選擇特徵詞"<<finalKeyWordsCount<<endl;finish=clock();totaltime=(double)(finish-start)/CLOCKS_PER_SEC;cout<<"遴選特徵詞共有了"<<totaltime<<endl;}

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More