The last crawl of the father, mother, teacher and his composition, using Sklearn.neighbors.KNeighborsClassifier classification.
ImportJiebaImportPandas as PDImportNumPy as NPImportOSImportItertoolsImportMatplotlib.pyplot as Plt fromSklearn.feature_extraction.textImportCountvectorizer fromSklearn.neighborsImportKneighborsclassifier fromSklearn.metricsImportConfusion_matrix fromSklearn.decompositionImportPCA#Read File contentsPath ='E:\ Composition'Corpos= PD. DataFrame (columns=['filepath','text','Kind']) forRoot,dirs,filesinchOs.walk (path): forNameinchFiles:filepath= root+'\\'+name F= Open (filepath,'R', encoding='Utf-8') Text=f.read () txt="'. Join (Text.split ('\ n')) Kind= Root.split ('\\') [-1] Corpos.loc[len (corpos)]=[Filepath,text.strip (), kind]#set the Stop Word to construct the frequency matrixStopwords = Pd.read_csv (r'Stopwords.txt', Encoding='Utf-8', sep='\ n')defTokenizer (s): Words=[] Cut=Jieba.cut (s) forWordinchcut:words.append (Word)returnWordscount= Countvectorizer (tokenizer=Tokenizer, Stop_words=list (stopwords['Stopword'])) Countvector= Count.fit_transform (corpos.iloc[:,1]). ToArray ()#Convert a category to a numberKind = Np.unique (corpos['Kind'].values) Nkind= Np.zeros (700) forIinchRange (len (kind)): Index= corpos[corpos['Kind']==Kind[i]].index Nkind[index]= I+1#Converts the word frequency matrix into two-dimensional data, drawingPCA = PCA (n_components=2) Newvector=pca.fit_transform (Countvector) plt.figure () forI,c,minchZip (range len (kind)), ['R','b','g','y'],['o','^','>','<']): Index= corpos[corpos['Kind']==Kind[i]].index x=newvector[index,0] y= newvector[index,1] Plt.scatter (x,y,c=c,marker=m,label=Kind[i]) plt.legend () Plt.xlim (-5,10) Plt.ylim (-20,50) Plt.xlabel ('X Label') Plt.ylabel ('Y Label')#randomly selected test setindex = Np.random.randint (0,700,200) X_test=countvector[index]y_test= corpos.iloc[index,2]#using KNN classificationKNN =kneighborsclassifier () knn.fit (countvector,corpos.iloc[:,2]) y_pred=knn.predict (x_test) knn.score (x_test,y_test)#a confusion matrix for the results of KNN classification
Knn_confusion = Confusion_matrix (y_test,y_pred)
‘‘‘
Array ([[1, 0, 3],
[8, 0, 1], [1, 0,, 1], [9, 1, 2, 24]]
Plt.imshow (knn_confusion,interpolation='Nearest', cmap=plt.cm.Oranges) Plt.xlabel ('y_pred') Plt.ylabel ('y_true') Tick_marks=Np.arange (len (kind)) Plt.xticks (Tick_marks,kind,rotation=90) plt.yticks (tick_marks,kind) Plt.colorbar () Plt.title ('Confustion_matrix') forI,jinchitertools.product (Range (len (knn_confusion)), Range (len (knn_confusion))): Plt.text (I,j,knn_confusion[j,i], HorizontalAlignment="Center")
The data scatter plot is as follows:
???
The confusion matrix for the KNN classification results is as follows:
Python uses KNN text classification