Using semi-supervised algorithm to do text classification

Source: Internet
Author: User

The person who has refined himself

You are welcome to visit my Pinterest and my blog.
This blog all content to study, research and sharing mainly, if need to reprint, please contact me, marked the author and source, and is non-commercial use, thank you!

Abstract : This paper mainly describes the semi-supervised algorithm to do text classification (two classification), the main reference is an example of Sklearn-semi-supervised algorithm to do digital recognition. First of all, this is a failure example, training to the 15,000th is not, the error. If your data volume is not very large, you can operate it. There are a lot of places to learn, especially about text preprocessing. The follow-up will be updated to get through the road.

I. Operating Procedures
  • A total of 1 million data, labeled 7,000, all the rest is labeled-1
  • preprocessing text (Jieba participle, removing stop words, etc.)
  • Convert processed text to TFIDF vector with Gensim library
  • Use the SciPy library to convert the format of the TFIDF vector so that it can be trained by the Sklearn Library's algorithm package
  • Training with pre-processed data input into the model
  • Get the first 500 indeterminate data, manually label it, and then add it back to the training set
  • Repeat the process until your classifier becomes good and meets your requirements
Two. Text

In front of the jieba participle, to stop the operation of the word I will repeat, it is relatively easy. Share here how to convert the TFIDF vector of Gensim training to the format required by the Sklearn Library algorithm package.

Speaking of which, we will certainly ask, do what so troublesome, directly call Sklearn calculate TFIDF not to be finished. The reason is very simple, I did this at the beginning, and then reported the memory error, I guess because the vector dimension is too large, because the Sklearn calculated tfidf vector per sentence of the dimension reached more than 30,000 dimensions, so I intend to borrow Gensim to train TFIDF vector, it has a benefit , you can specify the size of the dimension.

So how do you calculate TFIDF vectors?
You can also read this article--Calculate TFIDF values in different ways

1. Use Gensim to train the TFIDF vector and save it. You'll have to run it again the next time you use it. Code
from gensim import corporadictionary = corpora.Dictionary(word_list)new_corpus = [dictionary.doc2bow(text) for text in word_list]from gensim import modelstfidf = models.TfidfModel(new_corpus)tfidf.save(‘my_model.tfidf‘)
2. Load the model, train your data, get the TFIDF vector
tfidf = models.TfidfModel.load(‘my_model.tfidf‘)tfidf_vec = []for i in range(len(words)):    string = words[i]    string_bow = dictionary.doc2bow(string.split())    string_tfidf = tfidf[string_bow]    tfidf_vec.append(string_tfidf)
This is the TFIDF vector you get at this point.
    • The form of a tuple-(ID,TFIDF vector of data)
    • This format Sklearn the algorithm package is not training, how to do?
[[(0, 0.44219328927835233),  (1, 0.5488488134902755),  (2, 0.28062764931589196),  (3, 0.5488488134902755),  (4, 0.3510600763648036)], [(5, 0.2952063480959091),  (6, 0.3085138762011414),  (7, 0.269806482343891),  (8, 0.21686460370108193),  (9, 0.4621642239026475),  (10, 0.5515758504022944),  (11, 0.4242816486479956)],......]
3. Using LSI models to specify dimensions
lsi_model = models.LsiModel(corpus = tfidf_vec,id2word = dictionary,num_topics=2)lsi_vec = []for i in range(len(words)):    string = words[i]    string_bow = dictionary.doc2bow(string.split())    string_lsi = lsi_model[string_bow]    lsi_vec.append(string_lsi)
This is the TFIDF vector.
[[(0, 9.98164139346566e-06), (1, 0.00017488533996265734)], [(0, 0.004624808817003378), (1, 0.0052712355563472625)], [(0, 0.005992863818284904), (1, 0.0028891269605347066)], [(0, 0.008813713819377964), (1, 0.004300294830187425)], [(0, 0.0010709978891676652), (1, 0.004264312831567625)], [(0, 0.005647948200006063), (1, 0.005816420698368305)], [(0, 1.1749284917071102e-05), (1, 0.0003525210498926822)], [(0, 0.05046596444596279), (1, 0.03750969796637345)], [(0, 0.0007876011346475033), (1, 0.008538972615602887)],......]
4. Use the SciPy module to process data into a Sklearn-trained format
from scipy.sparse import csr_matrixdata = []rows = []cols = []line_count = 0for line in lsi_vec:    for elem in line:        rows.append(line_count)        cols.append(elem[0])        data.append(elem[1])    line_count += 1lsi_sparse_matrix = csr_matrix((data,(rows,cols))) # 稀疏向量lsi_matrix = lsi_sparse_matrix.toarray() # 密集向量
Lsi_matrix is shown below
    • This is the format required to meet the Sklearn.
Out[53]:array([[9.98164139e-06, 1.74885340e-04],       [4.62480882e-03, 5.27123556e-03],       [5.99286382e-03, 2.88912696e-03],       ...,       [1.85861559e-02, 3.24888917e-01],       [8.07737902e-04, 5.45659458e-03],       [2.61926460e-03, 2.30210522e-02]])
5. Call Sklearn's semi-supervised algorithm to train the data
    • The index of the most uncertain data in the first 2000 labels can be obtained by the following code, and then the corresponding data is found according to the index, which can be re-annotated manually.
    • Add the marked data back and cycle until you are satisfied
    • Of course, in this you have to mark 1000 of the data, and then assign them to 1, pretending not to know, and so on after the training, get the results of the prediction, and then the real value of your own label to compare calculations, you can know the effect of good or bad
import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsfrom sklearn.semi_supervised import label_propagationy = list(result.label.values)from scipy.sparse.csgraph import *n_total_samples = len(y) # 1571794n_labeled_points = 7804 # 标注好的数据共10条,只训练10个带标签的数据模型unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:] # 未标注的数据lp_model = label_propagation.LabelSpreading() # 训练模型lp_model.fit(lsi_matrix,y)    predicted_labels = lp_model.transduction_[unlabeled_indices] # 预测的标签        # 计算被转换的标签的分布的熵    # lp_model.label_distributions_ : array,shape=[n_samples,n_classes]    # Categorical distribution for each item    pred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)        # 选择分类器最不确定的前2000位数字的索引uncertainty_index = np.argsort(pred_entropies)[::1]uncertainty_index = uncertainty_index[    np.in1d(uncertainty_index,unlabeled_indices)][:2000] print(uncertainty_index)
Three. Results and discussions

I finally did not continue to do, because my data is very large, only training so little data is meaningless, the above is my idea, I hope it will be helpful to everyone, follow-up I will update new ways to operate this thing

Using semi-supervised algorithm to do text classification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.