International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Chinese text preprocessing process (take you to analyze each step)

Last Update:2018-10-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tags: chinese text preprocessing
The person who has refined himself
---
Welcome everyone to visit my Pinterest and my blog, if you feel uncomfortable in the format, you can also go to see my Jane book, there will be published
This blog all content to study, research and sharing mainly, if need to reprint, please contact me, marked the author and source, and is non-commercial use, thank you!

Summary

Machine learning my understanding is to turn all kinds of primitive things into something that machines can understand, and then use a variety of machine learning algorithms to do the work. What is something that the machine can understand? -- vector . So whether it's a picture or a text, you have to use machine learning algorithms to process them and turn them into vectors.

Most of the internet is to deal with the English text of the material, this article in Chinese text as an example, the original text is preprocessed to get the text vector

Directory

Remove a specified useless symbol

Make text retain only Chinese characters

Jieba participle of text

Remove discontinued words

Convert text to TFIDF vector and input into algorithm

Operation Flow 1. Remove the specified useless symbols

The text we get is sometimes a lot of space, or you don't want the symbol, then you can use this method to remove all the symbols you do not want. Here I take the space as an example

content = [‘  欢迎来到  炼己者的博客‘,‘炼己者     带你入门NLP  ‘]# 去掉文本中的空格def process(our_data):    m1 = map(lambda s: s.replace(‘ ‘, ‘‘), our_data)    return list(m1)print(process(content))

The passed-in parameter, Our_data, is a list of all the spaces in the text that can be removed. Look at the results of the output. You can see that all the blanks have been erased.

[‘欢迎来到炼己者的博客‘, ‘炼己者带你入门NLP‘]

2. Allow text to retain only Chinese characters

This operation I like best, he can remove all the symbols, including numbers, punctuation, letters and so on

content = [‘如果这篇文章对你有所帮助，那就点个赞呗！！！‘,‘如果想联系炼己者的话，那就打电话：110！！！‘,‘想学习NLP，那就来关注呀！^-^‘]# 让文本只保留汉字def is_chinese(uchar):    if uchar >= u‘\u4e00‘ and uchar <= u‘\u9fa5‘:        return True    else:        return Falsedef format_str(content):    content_str = ‘‘    for i in content:        if is_chinese(i):            content_str = content_str + ｉ    return content_str# 参函数传入的是每一句话chinese_list = []for line in content:    chinese_list.append(format_str(line))print(chinese_list)

Then we look at the output, and you will find that only Chinese is left. This operation is really too sexy.

[‘如果这篇文章对你有所帮助那就点个赞呗‘, ‘如果想联系炼己者的话那就打电话‘, ‘想学习那就来关注呀‘]

3. Jieba participle of text

First you have to download Jieba this library, direct pip install Jieba can be.
Let's do this as an example of what we've done with the above statement.

chinese_list = [‘如果这篇文章对你有所帮助那就点个赞呗‘, ‘如果想联系炼己者的话那就打电话‘, ‘想学习那就来关注呀‘]# 对文本进行jieba分词import jiebadef fenci(datas):    cut_words = map(lambda s: list(jieba.cut(s)), datas)    return list(cut_words)print(fenci(chinese_list))

And then you can get the results of the participle.

[[‘如果‘, ‘这‘, ‘篇文章‘, ‘对‘, ‘你‘, ‘有所‘, ‘帮助‘, ‘那‘, ‘就‘, ‘点个‘, ‘赞‘, ‘呗‘], [‘如果‘, ‘想‘, ‘联系‘, ‘炼己‘, ‘者‘, ‘的话‘, ‘那‘, ‘就‘, ‘打电话‘], [‘想‘, ‘学习‘, ‘那‘, ‘就‘, ‘来‘, ‘关注‘, ‘呀‘]]

4. Remove discontinued words

First you have to download a stop word on the Internet, or you can follow my public number :
Zhangyhpico, reply to stop using the glossary , you can get it. And then turn this stop word into a list
To make it easy for everyone to understand, here I'm going to assume a stop word list, and we take the data of the good words above as an example

# 分好词的数据fenci_list = [[‘如果‘, ‘这‘, ‘篇文章‘, ‘对‘, ‘你‘, ‘有所‘, ‘帮助‘, ‘那‘, ‘就‘, ‘点个‘, ‘赞‘, ‘呗‘], [‘如果‘, ‘想‘, ‘联系‘, ‘炼己‘, ‘者‘, ‘的话‘, ‘那‘, ‘就‘, ‘打电话‘], [‘想‘, ‘学习‘, ‘那‘, ‘就‘, ‘来‘, ‘关注‘, ‘呀‘]]# 停用词表stopwords = [‘的‘,‘呀‘,‘这‘,‘那‘,‘就‘,‘的话‘,‘如果‘]# 去掉文本中的停用词def drop_stopwords(contents, stopwords):    contents_clean = []    for line in contents:        line_clean = []        for word in line:            if word in stopwords:                continue            line_clean.append(word)        contents_clean.append(line_clean)    return contents_cleanprint(drop_stopwords(fenci_list,stopwords))

Let's take a look. The comparison found that some of the discontinued words were missing.

[[‘篇文章‘, ‘对‘, ‘你‘, ‘有所‘, ‘帮助‘, ‘点个‘, ‘赞‘, ‘呗‘], [‘想‘, ‘联系‘, ‘炼己‘, ‘者‘, ‘打电话‘], [‘想‘, ‘学习‘, ‘来‘, ‘关注‘]]

I think the above operation can also be used to remove some of the symbols you do not want, you can add the unused symbol to the inactive Word table, then it will be removed

5. Convert text to TFIDF vector and input into algorithm

This last step you can refer to this article operation, using different methods to calculate the TF-IDF value

But for the sake of completeness, I am here to show you the operation process again. Let's take the data from the above to remove the stop word.

word_list = [[‘篇文章‘, ‘对‘, ‘你‘, ‘有所‘, ‘帮助‘, ‘点个‘, ‘赞‘, ‘呗‘], [‘想‘, ‘联系‘, ‘炼己‘, ‘者‘, ‘打电话‘], [‘想‘, ‘学习‘, ‘来‘, ‘关注‘]]from gensim import corpora,modelsdictionary = corpora.Dictionary(word_list)new_corpus = [dictionary.doc2bow(text) for text in word_list]tfidf = models.TfidfModel(new_corpus)tfidf_vec = []for i in range(len(words)):    string = words[i]    string_bow = dictionary.doc2bow(string.split())    string_tfidf = tfidf[string_bow]    tfidf_vec.append(string_tfidf)print(tfidf_vec)

Here we can get the TFIDF vector, which is called the Gensim Library calculates the TFIDF vector, you can also directly call Sklearn library to calculate the TFIDF vector, how to see the above article, there are introduced. Let's see what the TFIDF vector looks like.

[[(0, 0.35355339059327373),  (1, 0.35355339059327373),  (2, 0.35355339059327373),  (3, 0.35355339059327373),  (4, 0.35355339059327373),  (5, 0.35355339059327373),  (6, 0.35355339059327373),  (7, 0.35355339059327373)], [(8, 0.18147115159841573),  (9, 0.49169813431045906),  (10, 0.49169813431045906),  (11, 0.49169813431045906),  (12, 0.49169813431045906)], [(8, 0.2084041054460164),  (13, 0.5646732768699807),  (14, 0.5646732768699807),  (15, 0.5646732768699807)]]

Obviously, the length of the sentence is different, so the dimensions of the TFIDF vector are different. So how do we do this? --You can use LSI vectors to ensure that the dimensions of the vectors are consistent

# num_topics参数可以用来指定维度lsi_model = models.LsiModel(corpus = tfidf_vec,id2word = dictionary,num_topics=2)lsi_vec = []for i in range(len(words)):    string = words[i]    string_bow = dictionary.doc2bow(string.split())    string_lsi = lsi_model[string_bow]    lsi_vec.append(string_lsi)print(lsi_vec)

Look at the results.

[[(1, 2.8284271247461907)], [(0, 1.6357709481422218)], [(0, 1.4464385059387106)]]

The machine learning algorithm for the Sklearn library is complete, and you can invoke these algorithm packages to operate. But the Sklearn algorithm requires that the format of the data must be in array format, so we have to find a way to convert the Gensim computed TFIDF vector format into an array format. Follow the steps below

from scipy.sparse import csr_matrixdata = []rows = []cols = []line_count = 0for line in lsi_vec:    for elem in line:        rows.append(line_count)        cols.append(elem[0])        data.append(elem[1])    line_count += 1lsi_sparse_matrix = csr_matrix((data,(rows,cols))) # 稀疏向量lsi_matrix = lsi_sparse_matrix.toarray() # 密集向量print(lsi_matrix)

The result is long.

array([[0.        , 2.82842712],       [1.63577095, 0.        ],       [1.44643851, 0.        ]])

Our purpose has been achieved. Certainly some people will ask, why do not you directly call Sklearn calculate TFIDF vector method, that much convenient, more direct. Why this conversion to change to go.

This is for a reason, assuming you have a large amount of data, millions of, then the TFIDF vector dimension calculated with Sklearn will be very large, and the last call to the Machine learning algorithm package will be an error. If you call Gensim to calculate the TFIDF vector and then use the above method, you can dimension the vectors, and you can also specify the dimensions. In the LSI vector step, thenum_topics parameter can be used to specify the dimension

Summarize

This is the entire Chinese text preprocessing, this process can cope with most of the text processing tasks. After you convert the text to a vector, the subsequent operations are easy, call the Sklearn algorithm package, or write a machine learning algorithm yourself, which are all methodical.

Hope can help everyone, if you think this article to you have some help, then just a praise support it! If there is any problem, you can also comment below the article, we communicate together to solve the problem!

Chinese text preprocessing process (take you to analyze each step)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

take text from pdf how to analyze wireshark data how to analyze website performance how to analyze adwords performance help translate to chinese take to play store match each galaxy to description

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Chinese text preprocessing process (take you to analyze each step)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support