International - English

Cart Console

Topic Center

Contact Sales

Home > Hot Categories > Business

Pandas Python Sklearn based on a group of business reviews (text category)

Last Update:2018-09-18 Source: Internet

Author: User

Tags lowercase

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

American Group Shop Evaluation Language Processing and classification (NLP)

The First Data Analysis section
The second visualization section,
This article is the third of the series, text classification
The main use of the package has Jieba,sklearn,pandas, this post mainly uses the word bag model (bag of words), the text in the form of a numerical feature vector (each document constructs a eigenvector, there are a lot of 0, the value appearing in the eigenvector is also called the original frequency, TF (term frequency), the resulting matrix is a sparse matrix)
Subsequent algorithmic models will be built in succession

Import Data Analysis Common library

import pandas as pdimport numpy as np

Read file

df=pd.read_excel("all_data_meituan.xlsx")[["comment","star"]]df.head()

To view the size of a dataframe

df.shape

(17400, 2)

df['sentiment']=df['star'].apply(lambda x:1 if x>30 else 0)df=df.drop_duplicates() ## 去掉重复的评论，剩余的文本1406条，我们将数据复制为原有数据的三倍df=df.dropna()

X=pd.concat([df[['comment']],df[['comment']],df[['comment']]])y=pd.concat([df.sentiment,df.sentiment,df.sentiment])X.columns=['comment']X.reset_indexX.shape

(3138, 1)

import jieba # 导入分词库def chinese_word_cut(mytext):    return " ".join(jieba.cut(mytext))X['cut_comment']=X["comment"].apply(chinese_word_cut)X['cut_comment'].head()

Building prefix dict from the default dictionary ...DEBUG:jieba:Building prefix dict from the default dictionary ...Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cacheDEBUG:jieba:Loading model from cache C:\Users\HUANG_~1\AppData\Local\Temp\jieba.cacheLoading model cost 0.880 seconds.DEBUG:jieba:Loading model cost 0.880 seconds.Prefix dict has been built succesfully.DEBUG:jieba:Prefix dict has been built succesfully.0    还行 吧 ， 建议 不要 排队 那个 烤鸭 和 羊肉串 ， 因为 烤肉 时间 本来 就 不够...1    去过 好 几次 了   东西 还是 老 样子   没 增添 什么 新花样   环境 倒 是 ...2    一个 字 ： 好 ！ ！ ！   # 羊肉串 #   # 五花肉 #   # 牛舌 #   ...3    第一次 来 吃 ， 之前 看过 好多 推荐 说 这个 好吃 ， 真的 抱 了 好 大 希望 ...4    羊肉串 真的 不太 好吃 ， 那种 说 膻 不 膻 说 臭 不 臭 的 味 。 烤鸭 还 行...Name: cut_comment, dtype: object

Import the data partition module in Sklearn, set the test dataset size, shuffle default ture

from sklearn.model_selection import  train_test_splitX_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)

Get discontinued words

def get_custom_stopwords(stop_words_file):    with open(stop_words_file,encoding="utf-8") as f:        custom_stopwords_list=[i.strip() for i in f.readlines()]    return custom_stopwords_list

stop_words_file = "stopwords.txt"stopwords = get_custom_stopwords(stop_words_file) # 获取停用词

Import Word Bag model

from sklearn.feature_extraction.text import  CountVectorizervect=CountVectorizer()  # 实例化vect # 查看参数

CountVectorizer(analyzer='word', binary=False, decode_error='strict',        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',        lowercase=True, max_df=1.0, max_features=None, min_df=1,        ngram_range=(1, 1), preprocessor=None, stop_words=None,        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',        tokenizer=None, vocabulary=None)

# dir(vect)  # 查看vect的属性

Fit_transform the split text, the coefficient matrix size is 2353*1965

vect.fit_transform(X_train["cut_comment"])

<2353x1965 sparse matrix of type '<class 'numpy.int64'>'    with 20491 stored elements in Compressed Sparse Row format>

vect.fit_transform(X_train["cut_comment"]).toarray().shape

(2353, 1965)

pd.DataFrame(vect.fit_transform(X_train["cut_comment"]).toarray(),columns=vect.get_feature_names()).iloc[:,0:25].head()# print(vect.get_feature_names())#  数据维数1956，不算很大（未使用停用词）# 将其转化为DataFrame

Found that there are a lot of numbers and invalid characteristics, and then passed the instantiation of the parameters, adding a regular match to take out these meaningless features, while removing the inactive words

vect = CountVectorizer(token_pattern=u'(?u)\\b[^\\d\\W]\\w+\\b',stop_words=frozenset(stopwords)) # 去除停用词，匹配以数字开头的非单词字符pd.DataFrame(vect.fit_transform(X_train['cut_comment']).toarray(), columns=vect.get_feature_names()).head()# 1691 columns,去掉以数字为特征值的列，减少了近三百列，由1965减小到1691 # max_df = 0.8 # 在超过这一比例的文档中出现的关键词（过于平凡），去除掉（可以自行设定）# min_df = 3 # 在低于这一数量的文档中出现的关键词（过于独特），去除掉。（可以自行设定）

After removing a digital feature

Model Building

Introduction of multidimensional Bayesian from Sklearn naive Bayes
Naive Bayesian stone commonly used to deal with text classification spam messages, speed, the effect is generally not much worse
The MULTINOMIALNB class can select default parameters, and if the model prediction capability does not meet the requirements, it can be adjusted appropriately

from sklearn.naive_bayes import MultinomialNBnb=MultinomialNB()

from sklearn.pipeline import make_pipeline # 导入make_pipeline方法pipe=make_pipeline(vect,nb)pipe.steps #  查看pipeline的步骤（与pipeline相似）

  [(' Countvectorizer ', Countvectorizer (analyzer= ' word ', binary=false, decode_error= ' strict ', dtype=< Class ' Numpy.int64 ', encoding= ' utf-8 ', input= ' content ', Lowercase=true, max_df=1.0, Max_features=none, MIN_DF =1, ngram_range= (1, 1), Preprocessor=none, Stop_words=frozenset ({', ' range ', ' wish ', ' vs ', ' for ', ' past ', ' concentration ', ' So ', ' Who knows ', ' think ', ' on ', ' 36 ', ' before and after ', ' every year ', ' long ', ' our ', ' otherwise ', ' use ', ' like ', ' such ', ' Not only ', ' once ', ' how ', ' Hold ', ' 6 ', ' All ', ' strict ' , ' except ', ' get ', ' how ', ' After all ', ' but ', ' as previously said ', ' meet ', ' your ', ' keeps ', ' just ', ' probably ', ' self ', ' concerning ', ' they ' re ', ' further ', ' intentional ' ... ' r easonably ', ' absolute ', ' a ', ' Beyond ', ' 50 ', ' get ', ' seeming ', ' just ', ' back-to-back ', ' Ephesians ', ' need ', ' its ', ' second ', ' and ' besides '}, strip_accents =none, token_pattern= ' (? u) \\b[^\\d\\w]\\w+\\b ', Tokenizer=none, Vocabulary=none)), (' MULTINOMIALNB ', MultinomialN B (alpha=1.0, Class_prior=none, Fit_prior=true))]

pipe.fit(X_train.cut_comment, y_train)

Pipeline(memory=None,     steps=[('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',        lowercase=True, max_df=1.0, max_features=None, min_df=1,        ngram_range=(1, 1), preprocessor=None,        stop_words=...e, vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

Test set Prediction Results

y_pred = pipe.predict(X_test.cut_comment) # 对测试集进行预测（其中包括了转化以及预测）

# 模型对于测试集的准确率from sklearn import  metricsmetrics.accuracy_score(y_test,y_pred)

0.82929936305732488

# 模型对于测试集的混淆矩阵metrics.confusion_matrix(y_test,y_pred)# 测试集中的预测结果：真阳性474个，假阳性112个，假阴性22个，真阴性为177个

array([[177, 112],       [ 22, 474]], dtype=int64)

def get_confusion_matrix(conf,clas):    import  matplotlib.pyplot as  plt    fig,ax=plt.subplots(figsize=(2.5,2.5))    ax.matshow(conf,cmap=plt.cm.Blues,alpha=0.3)    tick_marks = np.arange(len(clas))    plt.xticks(tick_marks,clas, rotation=45)    plt.yticks(tick_marks, clas)    for i in range(conf.shape[0]):        for j in range(conf.shape[1]):            ax.text(x=i,y=j,s=conf[i,j],                   va='center',                   ha='center')    plt.xlabel("predict_label")    plt.ylabel("true label")

conf=metrics.confusion_matrix(y_test,y_pred)class_names=np.array(['0','1'])get_confusion_matrix(np.array(conf),clas=class_names)plt.show()

To classify the entire data set in a forecast

y_pred_all = pipe.predict(X['cut_comment'])

metrics.accuracy_score(y,y_pred_all)# 对于整个样本集的预测正确率，整个数据集的准确率高于测试集，说明有些过拟合

0.85659655831739967

metrics.confusion_matrix(y,y_pred_all)#  真个数据集的混淆矩阵

array([[ 801,  369],       [  81, 1887]], dtype=int64)

y.value_counts()# 初始样本中 正类与负类的数量

1    19680    1170Name: sentiment, dtype: int64

metrics.f1_score(y_true=y,y_pred=y_pred_all)# f1_score 评价模型对于真个数据集

0.89346590909090906

metrics.recall_score(y, y_pred_all)# 检出率，也就是正类总样本检出的比例   真正/假阴+真正

0.95884146341463417

metrics.precision_score(y, y_pred_all)#  准确率，  检测出的来正类中真正类的比例  真正/假阳+真正

0.83643617021276595

print(metrics.classification_report(y, y_pred_all))# 分类报告

             precision    recall  f1-score   support      0       0.91      0.68      0.78      1170      1       0.84      0.96      0.89      1968avg / total       0.86      0.86      0.85      3138

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

p a group benefits card how to create group text on samsung galaxy s6 access business group amway pandas read text file iloc pandas python dtype python pandas python pandas dataframe tutorial

The simplified general version of ewebeditor 4.8 for business... 12-08

IBP 8.1 commercial registration edition (Internet Business pr... 12-08

Use PHP + MYSQL as a business card library program 12-08

Development of an EJB-based business Reservation System 12-08

Kesioncms scientific news Business Edition (AC + SQL) + dynam... 12-08

Chen Feng business network provides 30 m free ASP space servi... 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Pandas Python Sklearn based on a group of business reviews (text category)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support