International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Using semi-supervised algorithm to do text classification

Last Update:2018-08-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The person who has refined himself

You are welcome to visit my Pinterest and my blog.
This blog all content to study, research and sharing mainly, if need to reprint, please contact me, marked the author and source, and is non-commercial use, thank you!

Abstract : This paper mainly describes the semi-supervised algorithm to do text classification (two classification), the main reference is an example of Sklearn-semi-supervised algorithm to do digital recognition. First of all, this is a failure example, training to the 15,000th is not, the error. If your data volume is not very large, you can operate it. There are a lot of places to learn, especially about text preprocessing. The follow-up will be updated to get through the road.

I. Operating Procedures

A total of 1 million data, labeled 7,000, all the rest is labeled-1

preprocessing text (Jieba participle, removing stop words, etc.)

Convert processed text to TFIDF vector with Gensim library

Use the SciPy library to convert the format of the TFIDF vector so that it can be trained by the Sklearn Library's algorithm package

Training with pre-processed data input into the model

Get the first 500 indeterminate data, manually label it, and then add it back to the training set

Repeat the process until your classifier becomes good and meets your requirements

Two. Text

In front of the jieba participle, to stop the operation of the word I will repeat, it is relatively easy. Share here how to convert the TFIDF vector of Gensim training to the format required by the Sklearn Library algorithm package.

Speaking of which, we will certainly ask, do what so troublesome, directly call Sklearn calculate TFIDF not to be finished. The reason is very simple, I did this at the beginning, and then reported the memory error, I guess because the vector dimension is too large, because the Sklearn calculated tfidf vector per sentence of the dimension reached more than 30,000 dimensions, so I intend to borrow Gensim to train TFIDF vector, it has a benefit , you can specify the size of the dimension.

So how do you calculate TFIDF vectors?
You can also read this article--Calculate TFIDF values in different ways

1. Use Gensim to train the TFIDF vector and save it. You'll have to run it again the next time you use it. Code

from gensim import corporadictionary = corpora.Dictionary(word_list)new_corpus = [dictionary.doc2bow(text) for text in word_list]from gensim import modelstfidf = models.TfidfModel(new_corpus)tfidf.save(‘my_model.tfidf‘)

2. Load the model, train your data, get the TFIDF vector

tfidf = models.TfidfModel.load(‘my_model.tfidf‘)tfidf_vec = []for i in range(len(words)):    string = words[i]    string_bow = dictionary.doc2bow(string.split())    string_tfidf = tfidf[string_bow]    tfidf_vec.append(string_tfidf)

This is the TFIDF vector you get at this point.

The form of a tuple-(ID,TFIDF vector of data)
This format Sklearn the algorithm package is not training, how to do?

[[(0, 0.44219328927835233),  (1, 0.5488488134902755),  (2, 0.28062764931589196),  (3, 0.5488488134902755),  (4, 0.3510600763648036)], [(5, 0.2952063480959091),  (6, 0.3085138762011414),  (7, 0.269806482343891),  (8, 0.21686460370108193),  (9, 0.4621642239026475),  (10, 0.5515758504022944),  (11, 0.4242816486479956)],......]

3. Using LSI models to specify dimensions

lsi_model = models.LsiModel(corpus = tfidf_vec,id2word = dictionary,num_topics=2)lsi_vec = []for i in range(len(words)):    string = words[i]    string_bow = dictionary.doc2bow(string.split())    string_lsi = lsi_model[string_bow]    lsi_vec.append(string_lsi)

This is the TFIDF vector.

[[(0, 9.98164139346566e-06), (1, 0.00017488533996265734)], [(0, 0.004624808817003378), (1, 0.0052712355563472625)], [(0, 0.005992863818284904), (1, 0.0028891269605347066)], [(0, 0.008813713819377964), (1, 0.004300294830187425)], [(0, 0.0010709978891676652), (1, 0.004264312831567625)], [(0, 0.005647948200006063), (1, 0.005816420698368305)], [(0, 1.1749284917071102e-05), (1, 0.0003525210498926822)], [(0, 0.05046596444596279), (1, 0.03750969796637345)], [(0, 0.0007876011346475033), (1, 0.008538972615602887)],......]

4. Use the SciPy module to process data into a Sklearn-trained format

from scipy.sparse import csr_matrixdata = []rows = []cols = []line_count = 0for line in lsi_vec:    for elem in line:        rows.append(line_count)        cols.append(elem[0])        data.append(elem[1])    line_count += 1lsi_sparse_matrix = csr_matrix((data,(rows,cols))) # 稀疏向量lsi_matrix = lsi_sparse_matrix.toarray() # 密集向量

Lsi_matrix is shown below

This is the format required to meet the Sklearn.

Out[53]:array([[9.98164139e-06, 1.74885340e-04],       [4.62480882e-03, 5.27123556e-03],       [5.99286382e-03, 2.88912696e-03],       ...,       [1.85861559e-02, 3.24888917e-01],       [8.07737902e-04, 5.45659458e-03],       [2.61926460e-03, 2.30210522e-02]])

5. Call Sklearn's semi-supervised algorithm to train the data

The index of the most uncertain data in the first 2000 labels can be obtained by the following code, and then the corresponding data is found according to the index, which can be re-annotated manually.
Add the marked data back and cycle until you are satisfied
Of course, in this you have to mark 1000 of the data, and then assign them to 1, pretending not to know, and so on after the training, get the results of the prediction, and then the real value of your own label to compare calculations, you can know the effect of good or bad

import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsfrom sklearn.semi_supervised import label_propagationy = list(result.label.values)from scipy.sparse.csgraph import *n_total_samples = len(y) # 1571794n_labeled_points = 7804 # 标注好的数据共10条，只训练10个带标签的数据模型unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:] # 未标注的数据lp_model = label_propagation.LabelSpreading() # 训练模型lp_model.fit(lsi_matrix,y)    predicted_labels = lp_model.transduction_[unlabeled_indices] # 预测的标签        # 计算被转换的标签的分布的熵    # lp_model.label_distributions_ : array,shape=[n_samples,n_classes]    # Categorical distribution for each item    pred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)        # 选择分类器最不确定的前2000位数字的索引uncertainty_index = np.argsort(pred_entropies)[::1]uncertainty_index = uncertainty_index[    np.in1d(uncertainty_index,unlabeled_indices)][:2000] print(uncertainty_index)

Three. Results and discussions

I finally did not continue to do, because my data is very large, only training so little data is meaningless, the above is my idea, I hope it will be helpful to everyone, follow-up I will update new ways to operate this thing

Using semi-supervised algorithm to do text classification

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

unsupervised text classification python naive bayes text classification tutorial text similarity algorithm text matching algorithm transform xml to text file using xslt using perl to parse text files do shortcode

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using semi-supervised algorithm to do text classification

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support