Chinese text preprocessing process (take you to analyze each step)

Source: Internet
Author: User

Tags: chinese text preprocessing
The person who has refined himself
---
Welcome everyone to visit my Pinterest and my blog, if you feel uncomfortable in the format, you can also go to see my Jane book, there will be published
This blog all content to study, research and sharing mainly, if need to reprint, please contact me, marked the author and source, and is non-commercial use, thank you!

Summary
  • Machine learning my understanding is to turn all kinds of primitive things into something that machines can understand, and then use a variety of machine learning algorithms to do the work. What is something that the machine can understand? -- vector . So whether it's a picture or a text, you have to use machine learning algorithms to process them and turn them into vectors.
  • Most of the internet is to deal with the English text of the material, this article in Chinese text as an example, the original text is preprocessed to get the text vector
Directory
  • Remove a specified useless symbol
  • Make text retain only Chinese characters
  • Jieba participle of text
  • Remove discontinued words
  • Convert text to TFIDF vector and input into algorithm
Operation Flow 1. Remove the specified useless symbols

The text we get is sometimes a lot of space, or you don't want the symbol, then you can use this method to remove all the symbols you do not want. Here I take the space as an example

content = [‘  欢迎来到  炼己者的博客‘,‘炼己者     带你入门NLP  ‘]# 去掉文本中的空格def process(our_data):    m1 = map(lambda s: s.replace(‘ ‘, ‘‘), our_data)    return list(m1)print(process(content))

The passed-in parameter, Our_data, is a list of all the spaces in the text that can be removed. Look at the results of the output. You can see that all the blanks have been erased.

[‘欢迎来到炼己者的博客‘, ‘炼己者带你入门NLP‘]
2. Allow text to retain only Chinese characters

This operation I like best, he can remove all the symbols, including numbers, punctuation, letters and so on

content = [‘如果这篇文章对你有所帮助,那就点个赞呗!!!‘,‘如果想联系炼己者的话,那就打电话:110!!!‘,‘想学习NLP,那就来关注呀!^-^‘]# 让文本只保留汉字def is_chinese(uchar):    if uchar >= u‘\u4e00‘ and uchar <= u‘\u9fa5‘:        return True    else:        return Falsedef format_str(content):    content_str = ‘‘    for i in content:        if is_chinese(i):            content_str = content_str + i    return content_str# 参函数传入的是每一句话chinese_list = []for line in content:    chinese_list.append(format_str(line))print(chinese_list)

Then we look at the output, and you will find that only Chinese is left. This operation is really too sexy.

[‘如果这篇文章对你有所帮助那就点个赞呗‘, ‘如果想联系炼己者的话那就打电话‘, ‘想学习那就来关注呀‘]
3. Jieba participle of text

First you have to download Jieba this library, direct pip install Jieba can be.
Let's do this as an example of what we've done with the above statement.

chinese_list = [‘如果这篇文章对你有所帮助那就点个赞呗‘, ‘如果想联系炼己者的话那就打电话‘, ‘想学习那就来关注呀‘]# 对文本进行jieba分词import jiebadef fenci(datas):    cut_words = map(lambda s: list(jieba.cut(s)), datas)    return list(cut_words)print(fenci(chinese_list))

And then you can get the results of the participle.

[[‘如果‘, ‘这‘, ‘篇文章‘, ‘对‘, ‘你‘, ‘有所‘, ‘帮助‘, ‘那‘, ‘就‘, ‘点个‘, ‘赞‘, ‘呗‘], [‘如果‘, ‘想‘, ‘联系‘, ‘炼己‘, ‘者‘, ‘的话‘, ‘那‘, ‘就‘, ‘打电话‘], [‘想‘, ‘学习‘, ‘那‘, ‘就‘, ‘来‘, ‘关注‘, ‘呀‘]]
4. Remove discontinued words

First you have to download a stop word on the Internet, or you can follow my public number :
Zhangyhpico, reply to stop using the glossary , you can get it. And then turn this stop word into a list
To make it easy for everyone to understand, here I'm going to assume a stop word list, and we take the data of the good words above as an example

# 分好词的数据fenci_list = [[‘如果‘, ‘这‘, ‘篇文章‘, ‘对‘, ‘你‘, ‘有所‘, ‘帮助‘, ‘那‘, ‘就‘, ‘点个‘, ‘赞‘, ‘呗‘], [‘如果‘, ‘想‘, ‘联系‘, ‘炼己‘, ‘者‘, ‘的话‘, ‘那‘, ‘就‘, ‘打电话‘], [‘想‘, ‘学习‘, ‘那‘, ‘就‘, ‘来‘, ‘关注‘, ‘呀‘]]# 停用词表stopwords = [‘的‘,‘呀‘,‘这‘,‘那‘,‘就‘,‘的话‘,‘如果‘]# 去掉文本中的停用词def drop_stopwords(contents, stopwords):    contents_clean = []    for line in contents:        line_clean = []        for word in line:            if word in stopwords:                continue            line_clean.append(word)        contents_clean.append(line_clean)    return contents_cleanprint(drop_stopwords(fenci_list,stopwords))

Let's take a look. The comparison found that some of the discontinued words were missing.

[[‘篇文章‘, ‘对‘, ‘你‘, ‘有所‘, ‘帮助‘, ‘点个‘, ‘赞‘, ‘呗‘], [‘想‘, ‘联系‘, ‘炼己‘, ‘者‘, ‘打电话‘], [‘想‘, ‘学习‘, ‘来‘, ‘关注‘]]

I think the above operation can also be used to remove some of the symbols you do not want, you can add the unused symbol to the inactive Word table, then it will be removed

5. Convert text to TFIDF vector and input into algorithm

This last step you can refer to this article operation, using different methods to calculate the TF-IDF value

But for the sake of completeness, I am here to show you the operation process again. Let's take the data from the above to remove the stop word.

word_list = [[‘篇文章‘, ‘对‘, ‘你‘, ‘有所‘, ‘帮助‘, ‘点个‘, ‘赞‘, ‘呗‘], [‘想‘, ‘联系‘, ‘炼己‘, ‘者‘, ‘打电话‘], [‘想‘, ‘学习‘, ‘来‘, ‘关注‘]]from gensim import corpora,modelsdictionary = corpora.Dictionary(word_list)new_corpus = [dictionary.doc2bow(text) for text in word_list]tfidf = models.TfidfModel(new_corpus)tfidf_vec = []for i in range(len(words)):    string = words[i]    string_bow = dictionary.doc2bow(string.split())    string_tfidf = tfidf[string_bow]    tfidf_vec.append(string_tfidf)print(tfidf_vec)

Here we can get the TFIDF vector, which is called the Gensim Library calculates the TFIDF vector, you can also directly call Sklearn library to calculate the TFIDF vector, how to see the above article, there are introduced. Let's see what the TFIDF vector looks like.

[[(0, 0.35355339059327373),  (1, 0.35355339059327373),  (2, 0.35355339059327373),  (3, 0.35355339059327373),  (4, 0.35355339059327373),  (5, 0.35355339059327373),  (6, 0.35355339059327373),  (7, 0.35355339059327373)], [(8, 0.18147115159841573),  (9, 0.49169813431045906),  (10, 0.49169813431045906),  (11, 0.49169813431045906),  (12, 0.49169813431045906)], [(8, 0.2084041054460164),  (13, 0.5646732768699807),  (14, 0.5646732768699807),  (15, 0.5646732768699807)]]

Obviously, the length of the sentence is different, so the dimensions of the TFIDF vector are different. So how do we do this? --You can use LSI vectors to ensure that the dimensions of the vectors are consistent

# num_topics参数可以用来指定维度lsi_model = models.LsiModel(corpus = tfidf_vec,id2word = dictionary,num_topics=2)lsi_vec = []for i in range(len(words)):    string = words[i]    string_bow = dictionary.doc2bow(string.split())    string_lsi = lsi_model[string_bow]    lsi_vec.append(string_lsi)print(lsi_vec)

Look at the results.

[[(1, 2.8284271247461907)], [(0, 1.6357709481422218)], [(0, 1.4464385059387106)]]

The machine learning algorithm for the Sklearn library is complete, and you can invoke these algorithm packages to operate. But the Sklearn algorithm requires that the format of the data must be in array format, so we have to find a way to convert the Gensim computed TFIDF vector format into an array format. Follow the steps below

from scipy.sparse import csr_matrixdata = []rows = []cols = []line_count = 0for line in lsi_vec:    for elem in line:        rows.append(line_count)        cols.append(elem[0])        data.append(elem[1])    line_count += 1lsi_sparse_matrix = csr_matrix((data,(rows,cols))) # 稀疏向量lsi_matrix = lsi_sparse_matrix.toarray() # 密集向量print(lsi_matrix)

The result is long.

array([[0.        , 2.82842712],       [1.63577095, 0.        ],       [1.44643851, 0.        ]])

Our purpose has been achieved. Certainly some people will ask, why do not you directly call Sklearn calculate TFIDF vector method, that much convenient, more direct. Why this conversion to change to go.

This is for a reason, assuming you have a large amount of data, millions of, then the TFIDF vector dimension calculated with Sklearn will be very large, and the last call to the Machine learning algorithm package will be an error. If you call Gensim to calculate the TFIDF vector and then use the above method, you can dimension the vectors, and you can also specify the dimensions. In the LSI vector step, thenum_topics parameter can be used to specify the dimension

Summarize

This is the entire Chinese text preprocessing, this process can cope with most of the text processing tasks. After you convert the text to a vector, the subsequent operations are easy, call the Sklearn algorithm package, or write a machine learning algorithm yourself, which are all methodical.

Hope can help everyone, if you think this article to you have some help, then just a praise support it! If there is any problem, you can also comment below the article, we communicate together to solve the problem!

Chinese text preprocessing process (take you to analyze each step)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.