American Group Shop Evaluation Language Processing and classification (NLP)
- The First Data Analysis section
- The second visualization section,
- This article is the third of the series, text classification
- The main use of the package has Jieba,sklearn,pandas, this post mainly uses the word bag model (bag of words), the text in the form of a numerical feature vector (each document constructs a eigenvector, there are a lot of 0, the value appearing in the eigenvector is also called the original frequency, TF (term frequency), the resulting matrix is a sparse matrix)
- Subsequent algorithmic models will be built in succession
Import Data Analysis Common library
import pandas as pdimport numpy as np
df=pd.read_excel("all_data_meituan.xlsx")[["comment","star"]]df.head()
- To view the size of a dataframe
df.shape
(17400, 2)
df['sentiment']=df['star'].apply(lambda x:1 if x>30 else 0)df=df.drop_duplicates() ## 去掉重复的评论,剩余的文本1406条,我们将数据复制为原有数据的三倍df=df.dropna()
X=pd.concat([df[['comment']],df[['comment']],df[['comment']]])y=pd.concat([df.sentiment,df.sentiment,df.sentiment])X.columns=['comment']X.reset_indexX.shape
(3138, 1)
import jieba # 导入分词库def chinese_word_cut(mytext): return " ".join(jieba.cut(mytext))X['cut_comment']=X["comment"].apply(chinese_word_cut)X['cut_comment'].head()
Building prefix dict from the default dictionary ...DEBUG:jieba:Building prefix dict from the default dictionary ...Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cacheDEBUG:jieba:Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cacheLoading model cost 0.880 seconds.DEBUG:jieba:Loading model cost 0.880 seconds.Prefix dict has been built succesfully.DEBUG:jieba:Prefix dict has been built succesfully.0 还行 吧 , 建议 不要 排队 那个 烤鸭 和 羊肉串 , 因为 烤肉 时间 本来 就 不够...1 去过 好 几次 了 东西 还是 老 样子 没 增添 什么 新花样 环境 倒 是 ...2 一个 字 : 好 ! ! ! # 羊肉串 # # 五花肉 # # 牛舌 # ...3 第一次 来 吃 , 之前 看过 好多 推荐 说 这个 好吃 , 真的 抱 了 好 大 希望 ...4 羊肉串 真的 不太 好吃 , 那种 说 膻 不 膻 说 臭 不 臭 的 味 。 烤鸭 还 行...Name: cut_comment, dtype: object
- Import the data partition module in Sklearn, set the test dataset size, shuffle default ture
from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)
def get_custom_stopwords(stop_words_file): with open(stop_words_file,encoding="utf-8") as f: custom_stopwords_list=[i.strip() for i in f.readlines()] return custom_stopwords_list
stop_words_file = "stopwords.txt"stopwords = get_custom_stopwords(stop_words_file) # 获取停用词
from sklearn.feature_extraction.text import CountVectorizervect=CountVectorizer() # 实例化vect # 查看参数
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
# dir(vect) # 查看vect的属性
- Fit_transform the split text, the coefficient matrix size is 2353*1965
vect.fit_transform(X_train["cut_comment"])
<2353x1965 sparse matrix of type '<class 'numpy.int64'>' with 20491 stored elements in Compressed Sparse Row format>
vect.fit_transform(X_train["cut_comment"]).toarray().shape
(2353, 1965)
pd.DataFrame(vect.fit_transform(X_train["cut_comment"]).toarray(),columns=vect.get_feature_names()).iloc[:,0:25].head()# print(vect.get_feature_names())# 数据维数1956,不算很大(未使用停用词)# 将其转化为DataFrame
- Found that there are a lot of numbers and invalid characteristics, and then passed the instantiation of the parameters, adding a regular match to take out these meaningless features, while removing the inactive words
vect = CountVectorizer(token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',stop_words=frozenset(stopwords)) # 去除停用词,匹配以数字开头的非单词字符pd.DataFrame(vect.fit_transform(X_train['cut_comment']).toarray(), columns=vect.get_feature_names()).head()# 1691 columns,去掉以数字为特征值的列,减少了近三百列,由1965减小到1691 # max_df = 0.8 # 在超过这一比例的文档中出现的关键词(过于平凡),去除掉(可以自行设定)# min_df = 3 # 在低于这一数量的文档中出现的关键词(过于独特),去除掉。(可以自行设定)
- After removing a digital feature
Model Building
- Introduction of multidimensional Bayesian from Sklearn naive Bayes
- Naive Bayesian stone commonly used to deal with text classification spam messages, speed, the effect is generally not much worse
- The MULTINOMIALNB class can select default parameters, and if the model prediction capability does not meet the requirements, it can be adjusted appropriately
from sklearn.naive_bayes import MultinomialNBnb=MultinomialNB()
from sklearn.pipeline import make_pipeline # 导入make_pipeline方法pipe=make_pipeline(vect,nb)pipe.steps # 查看pipeline的步骤(与pipeline相似)
[(' Countvectorizer ', Countvectorizer (analyzer= ' word ', binary=false, decode_error= ' strict ', dtype=< Class ' Numpy.int64 ', encoding= ' utf-8 ', input= ' content ', Lowercase=true, max_df=1.0, Max_features=none, MIN_DF =1, ngram_range= (1, 1), Preprocessor=none, Stop_words=frozenset ({', ' range ', ' wish ', ' vs ', ' for ', ' past ', ' concentration ', ' So ', ' Who knows ', ' think ', ' on ', ' 36 ', ' before and after ', ' every year ', ' long ', ' our ', ' otherwise ', ' use ', ' like ', ' such ', ' Not only ', ' once ', ' how ', ' Hold ', ' 6 ', ' All ', ' strict ' , ' except ', ' get ', ' how ', ' After all ', ' but ', ' as previously said ', ' meet ', ' your ', ' keeps ', ' just ', ' probably ', ' self ', ' concerning ', ' they ' re ', ' further ', ' intentional ' ... ' r easonably ', ' absolute ', ' a ', ' Beyond ', ' 50 ', ' get ', ' seeming ', ' just ', ' back-to-back ', ' Ephesians ', ' need ', ' its ', ' second ', ' and ' besides '}, strip_accents =none, token_pattern= ' (? u) \\b[^\\d\\w]\\w+\\b ', Tokenizer=none, Vocabulary=none)), (' MULTINOMIALNB ', MultinomialN B (alpha=1.0, Class_prior=none, Fit_prior=true))]
pipe.fit(X_train.cut_comment, y_train)
Pipeline(memory=None, steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=...e, vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
Test set Prediction Results
y_pred = pipe.predict(X_test.cut_comment) # 对测试集进行预测(其中包括了转化以及预测)
# 模型对于测试集的准确率from sklearn import metricsmetrics.accuracy_score(y_test,y_pred)
0.82929936305732488
# 模型对于测试集的混淆矩阵metrics.confusion_matrix(y_test,y_pred)# 测试集中的预测结果:真阳性474个,假阳性112个,假阴性22个,真阴性为177个
array([[177, 112], [ 22, 474]], dtype=int64)
def get_confusion_matrix(conf,clas): import matplotlib.pyplot as plt fig,ax=plt.subplots(figsize=(2.5,2.5)) ax.matshow(conf,cmap=plt.cm.Blues,alpha=0.3) tick_marks = np.arange(len(clas)) plt.xticks(tick_marks,clas, rotation=45) plt.yticks(tick_marks, clas) for i in range(conf.shape[0]): for j in range(conf.shape[1]): ax.text(x=i,y=j,s=conf[i,j], va='center', ha='center') plt.xlabel("predict_label") plt.ylabel("true label")
conf=metrics.confusion_matrix(y_test,y_pred)class_names=np.array(['0','1'])get_confusion_matrix(np.array(conf),clas=class_names)plt.show()
To classify the entire data set in a forecast
y_pred_all = pipe.predict(X['cut_comment'])
metrics.accuracy_score(y,y_pred_all)# 对于整个样本集的预测正确率,整个数据集的准确率高于测试集,说明有些过拟合
0.85659655831739967
metrics.confusion_matrix(y,y_pred_all)# 真个数据集的混淆矩阵
array([[ 801, 369], [ 81, 1887]], dtype=int64)
y.value_counts()# 初始样本中 正类与负类的数量
1 19680 1170Name: sentiment, dtype: int64
metrics.f1_score(y_true=y,y_pred=y_pred_all)# f1_score 评价模型对于真个数据集
0.89346590909090906
metrics.recall_score(y, y_pred_all)# 检出率,也就是正类总样本检出的比例 真正/假阴+真正
0.95884146341463417
metrics.precision_score(y, y_pred_all)# 准确率, 检测出的来正类中真正类的比例 真正/假阳+真正
0.83643617021276595
print(metrics.classification_report(y, y_pred_all))# 分类报告
precision recall f1-score support 0 0.91 0.68 0.78 1170 1 0.84 0.96 0.89 1968avg / total 0.86 0.86 0.85 3138