Pandas Python Sklearn based on a group of business reviews (text category)

Source: Internet
Author: User
American Group Shop Evaluation Language Processing and classification (NLP)
    • The First Data Analysis section
    • The second visualization section,
    • This article is the third of the series, text classification
    • The main use of the package has Jieba,sklearn,pandas, this post mainly uses the word bag model (bag of words), the text in the form of a numerical feature vector (each document constructs a eigenvector, there are a lot of 0, the value appearing in the eigenvector is also called the original frequency, TF (term frequency), the resulting matrix is a sparse matrix)
    • Subsequent algorithmic models will be built in succession
Import Data Analysis Common library
import pandas as pdimport numpy as np
    • Read file
df=pd.read_excel("all_data_meituan.xlsx")[["comment","star"]]df.head()

    • To view the size of a dataframe
df.shape
(17400, 2)
df['sentiment']=df['star'].apply(lambda x:1 if x>30 else 0)df=df.drop_duplicates() ## 去掉重复的评论,剩余的文本1406条,我们将数据复制为原有数据的三倍df=df.dropna()
X=pd.concat([df[['comment']],df[['comment']],df[['comment']]])y=pd.concat([df.sentiment,df.sentiment,df.sentiment])X.columns=['comment']X.reset_indexX.shape
(3138, 1)
import jieba # 导入分词库def chinese_word_cut(mytext):    return " ".join(jieba.cut(mytext))X['cut_comment']=X["comment"].apply(chinese_word_cut)X['cut_comment'].head()
Building prefix dict from the default dictionary ...DEBUG:jieba:Building prefix dict from the default dictionary ...Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cacheDEBUG:jieba:Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cacheLoading model cost 0.880 seconds.DEBUG:jieba:Loading model cost 0.880 seconds.Prefix dict has been built succesfully.DEBUG:jieba:Prefix dict has been built succesfully.0    还行 吧 , 建议 不要 排队 那个 烤鸭 和 羊肉串 , 因为 烤肉 时间 本来 就 不够...1    去过 好 几次 了   东西 还是 老 样子   没 增添 什么 新花样   环境 倒 是 ...2    一个 字 : 好 ! ! !   # 羊肉串 #   # 五花肉 #   # 牛舌 #   ...3    第一次 来 吃 , 之前 看过 好多 推荐 说 这个 好吃 , 真的 抱 了 好 大 希望 ...4    羊肉串 真的 不太 好吃 , 那种 说 膻 不 膻 说 臭 不 臭 的 味 。 烤鸭 还 行...Name: cut_comment, dtype: object
    • Import the data partition module in Sklearn, set the test dataset size, shuffle default ture
from sklearn.model_selection import  train_test_splitX_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)
    • Get discontinued words
def get_custom_stopwords(stop_words_file):    with open(stop_words_file,encoding="utf-8") as f:        custom_stopwords_list=[i.strip() for i in f.readlines()]    return custom_stopwords_list
stop_words_file = "stopwords.txt"stopwords = get_custom_stopwords(stop_words_file) # 获取停用词
    • Import Word Bag model
from sklearn.feature_extraction.text import  CountVectorizervect=CountVectorizer()  # 实例化vect # 查看参数
CountVectorizer(analyzer='word', binary=False, decode_error='strict',        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',        lowercase=True, max_df=1.0, max_features=None, min_df=1,        ngram_range=(1, 1), preprocessor=None, stop_words=None,        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',        tokenizer=None, vocabulary=None)
# dir(vect)  # 查看vect的属性
    • Fit_transform the split text, the coefficient matrix size is 2353*1965
vect.fit_transform(X_train["cut_comment"])
<2353x1965 sparse matrix of type '<class 'numpy.int64'>'    with 20491 stored elements in Compressed Sparse Row format>
vect.fit_transform(X_train["cut_comment"]).toarray().shape
(2353, 1965)
pd.DataFrame(vect.fit_transform(X_train["cut_comment"]).toarray(),columns=vect.get_feature_names()).iloc[:,0:25].head()# print(vect.get_feature_names())#  数据维数1956,不算很大(未使用停用词)# 将其转化为DataFrame
    • Found that there are a lot of numbers and invalid characteristics, and then passed the instantiation of the parameters, adding a regular match to take out these meaningless features, while removing the inactive words
vect = CountVectorizer(token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',stop_words=frozenset(stopwords)) # 去除停用词,匹配以数字开头的非单词字符pd.DataFrame(vect.fit_transform(X_train['cut_comment']).toarray(), columns=vect.get_feature_names()).head()# 1691 columns,去掉以数字为特征值的列,减少了近三百列,由1965减小到1691 # max_df = 0.8 # 在超过这一比例的文档中出现的关键词(过于平凡),去除掉(可以自行设定)# min_df = 3 # 在低于这一数量的文档中出现的关键词(过于独特),去除掉。(可以自行设定)
    • After removing a digital feature

Model Building
    • Introduction of multidimensional Bayesian from Sklearn naive Bayes
    • Naive Bayesian stone commonly used to deal with text classification spam messages, speed, the effect is generally not much worse
    • The MULTINOMIALNB class can select default parameters, and if the model prediction capability does not meet the requirements, it can be adjusted appropriately
from sklearn.naive_bayes import MultinomialNBnb=MultinomialNB()  
from sklearn.pipeline import make_pipeline # 导入make_pipeline方法pipe=make_pipeline(vect,nb)pipe.steps #  查看pipeline的步骤(与pipeline相似)
  [(' Countvectorizer ', Countvectorizer (analyzer= ' word ', binary=false, decode_error= ' strict ', dtype=< Class ' Numpy.int64 ', encoding= ' utf-8 ', input= ' content ', Lowercase=true, max_df=1.0, Max_features=none, MIN_DF =1, ngram_range= (1, 1), Preprocessor=none, Stop_words=frozenset ({', ' range ', ' wish ', ' vs ', ' for ', ' past ', ' concentration ', ' So ', ' Who knows ', ' think ', ' on ', ' 36 ', ' before and after ', ' every year ', ' long ', ' our ', ' otherwise ', ' use ', ' like ', ' such ', ' Not only ', ' once ', ' how ', ' Hold ', ' 6 ', ' All ', ' strict ' , ' except ', ' get ', ' how ', ' After all ', ' but ', ' as previously said ', ' meet ', ' your ', ' keeps ', ' just ', ' probably ', ' self ', ' concerning ', ' they ' re ', ' further ', ' intentional ' ... ' r easonably ', ' absolute ', ' a ', ' Beyond ', ' 50 ', ' get ', ' seeming ', ' just ', ' back-to-back ', ' Ephesians ', ' need ', ' its ', ' second ', ' and ' besides '}, strip_accents =none, token_pattern= ' (? u) \\b[^\\d\\w]\\w+\\b ', Tokenizer=none, Vocabulary=none)), (' MULTINOMIALNB ', MultinomialN B (alpha=1.0, Class_prior=none, Fit_prior=true))]  
pipe.fit(X_train.cut_comment, y_train)
Pipeline(memory=None,     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',        lowercase=True, max_df=1.0, max_features=None, min_df=1,        ngram_range=(1, 1), preprocessor=None,        stop_words=...e, vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
Test set Prediction Results
y_pred = pipe.predict(X_test.cut_comment) # 对测试集进行预测(其中包括了转化以及预测)
# 模型对于测试集的准确率from sklearn import  metricsmetrics.accuracy_score(y_test,y_pred)
0.82929936305732488
# 模型对于测试集的混淆矩阵metrics.confusion_matrix(y_test,y_pred)# 测试集中的预测结果:真阳性474个,假阳性112个,假阴性22个,真阴性为177个
array([[177, 112],       [ 22, 474]], dtype=int64)
def get_confusion_matrix(conf,clas):    import  matplotlib.pyplot as  plt    fig,ax=plt.subplots(figsize=(2.5,2.5))    ax.matshow(conf,cmap=plt.cm.Blues,alpha=0.3)    tick_marks = np.arange(len(clas))    plt.xticks(tick_marks,clas, rotation=45)    plt.yticks(tick_marks, clas)    for i in range(conf.shape[0]):        for j in range(conf.shape[1]):            ax.text(x=i,y=j,s=conf[i,j],                   va='center',                   ha='center')    plt.xlabel("predict_label")    plt.ylabel("true label")
conf=metrics.confusion_matrix(y_test,y_pred)class_names=np.array(['0','1'])get_confusion_matrix(np.array(conf),clas=class_names)plt.show()

To classify the entire data set in a forecast
y_pred_all = pipe.predict(X['cut_comment'])
metrics.accuracy_score(y,y_pred_all)# 对于整个样本集的预测正确率,整个数据集的准确率高于测试集,说明有些过拟合
0.85659655831739967
metrics.confusion_matrix(y,y_pred_all)#  真个数据集的混淆矩阵
array([[ 801,  369],       [  81, 1887]], dtype=int64)
y.value_counts()# 初始样本中 正类与负类的数量
1    19680    1170Name: sentiment, dtype: int64
metrics.f1_score(y_true=y,y_pred=y_pred_all)# f1_score 评价模型对于真个数据集
0.89346590909090906
metrics.recall_score(y, y_pred_all)# 检出率,也就是正类总样本检出的比例   真正/假阴+真正
0.95884146341463417
metrics.precision_score(y, y_pred_all)#  准确率,  检测出的来正类中真正类的比例  真正/假阳+真正
0.83643617021276595
print(metrics.classification_report(y, y_pred_all))# 分类报告
             precision    recall  f1-score   support      0       0.91      0.68      0.78      1170      1       0.84      0.96      0.89      1968avg / total       0.86      0.86      0.85      3138

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: