A small text classification system-python (a corpus attached to the end of the text, deprecated the word text document, and all the code of the project ),

Source: Internet
Author: User
Tags readfile svm idf

A small text classification system-python (a corpus attached to the end of the text, deprecated the word text document, and all the code of the project ),
Background

Text Mining refers to the process of extracting unknown, understandable, and ultimately available knowledge from a large amount of text data, and using this knowledge to better organize information for future reference. That is, the process of searching for knowledge from unstructured texts.

Currently, Text Mining has seven main fields:

  • · Search and Information Retrieval IR
  • · Text clustering: Use a clustering method to group and classify words, segments, paragraphs, or files.
  • · Text classification: groups and classifies fragments, paragraphs, or files. Based on the Data Mining classification method, the instance model is marked as trained.
  • · Web Mining: Data and text mining on the Internet, with special attention to the scale and interconnectivity of networks
  • · Information Extraction IE: identifies and Extracts Relevant facts and relationships from unstructured texts; and extracts structured data from unstructured or semi-structured texts.
  • · Natural Language Processing (NLP): discovering the structure and meaning of language essence from the perspective of syntax and Semantics
Text Classification System (python 3.5)

The text classification technology and process of Chinese language mainly includes the following steps:
1. Preprocessing: remove text noise information, such as HTML tags, text format conversion, and sentence Boundary Detection

2. Chinese Word Segmentation: Use a Chinese Word divider for text word segmentation and remove deprecated words

3. Construct the word Vector Space: Count the Word Frequency of the text and generate the word vector space of the text

4. Weight strategy-TF-IDF: Use TF-IDF to discover feature words and extract features that reflect the topic of the document

5. Classification words: use algorithms to train Classifiers

6. Evaluation classification results

 

1. Preprocessing

A. Select the text processing range.

B. Establish a text corpus for classification

  • · Training set Corpus

Text resources of classes already divided

  • · Test Set Corpus

The text corpus to be classified can be a part of the training set or an external source text corpus.

C. Text format conversion: Remove html tags using the Python lxml Library

D. Checking Sentence Boundaries: marking the end of a sentence

2. Chinese Word Segmentation

Word segmentation refers to the process of re-composing word Sequences Based on certain specifications. Chinese word segmentation refers to dividing a Chinese Character Sequence (sentence) into independent words. Chinese Word Segmentation is very complicated, to some extent, it is not entirely an algorithm problem. Finally, probability theory solves this problem. The algorithm is a conditional random field (CRF) based on the probability graph model)

Word Segmentation is the most basic and underlying module in natural language processing. Word Segmentation accuracy has a great impact on subsequent application modules. structured representation of text or sentences is the core task in language processing, currently, structured representation of text can be divided into four categories: Word vector space, subject model, Tree Representation of dependency syntax, and Graph Representation of RDF.

The following is a sample code for Chinese words:

 

#-*-Coding: UTF-8 -*-
ImportOS
ImportJieba


DefSavefile (savepath, content ):
Fp = open (savepath,"W", Encoding ='Gb2312', Errors ='Ignore')
Fp. write (content)
Fp. close ()


DefReadfile (path ):
Fp = open (path,"R", Encoding ='Gb2312', Errors ='Ignore')
Content = fp. read ()
Fp. close ()
ReturnContent

# Corpus_path = "train_small /"#Unsegmented classification prediction LIBRARY PATH
# Seg_path = "train_seg/" # path of the segmented Corpus

Corpus_path ="Test_small /"#Unsegmented classification prediction LIBRARY PATH
Seg_path ="Test_seg /"#Classifier corpus path after word segmentation

Catelist = OS. listdir (corpus_path)#Retrieve all subdirectories in the modified directory

ForMydirInCatelist:
Class_path = corpus_path + mydir +"/"#Spell out the path of the category subdirectory
Seg_dir = seg_path + mydir +"/"#Extract pre-classification directories after word segmentation
If notOS. path. exists (seg_dir ):#Whether it exists. If it does not exist, it is created.
OS. makedirs (seg_dir)
File_list = OS. listdir (class_path)
ForFile_pathInFile_list:
Fullname = class_path + file_path
Content = readfile (fullname). strip ()#Read File Content
Content = content. replace ("\ R \ n",""). Strip ()#Delete line breaks and extra spaces
Content_seg = jieba. cut (content)
Savefile (seg_dir + file_path,"". Join (content_seg ))

Print ("Word Segmentation ends")

 

For the convenience of generating the word vector space model in the future, the text information after word segmentation must be converted into text Vector Information and converted to object. The data structure of the Scikit-Learn library is used. The Code is as follows:

 

ImportOS

ImportPickle
FromSklearn. datasets. baseImportBunch
# BunchClass provides an object form of key and value
# Target_name name list of all category Sets
# Label the classification label list of each file
# Filenames file path
# Contents word segmentation file word Vector Form
DefReadfile (path ):
Fp = open (path,"R", Encoding ='Gb2312', Errors ='Ignore')
Content = fp. read ()
Fp. close ()
ReturnContent

Bunch = Bunch (target_name = [], label = [], filenames = [], contents = [])

# Wordbag_path = "train_word_bag/train_set.dat"
# Seg_path = "train_seg /"
Wordbag_path ="Test_word_bag/test_set.dat"
Seg_path ="Test_seg /"

Catelist = OS. listdir (seg_path)
Bunch.tar get_name.extend (catelist)#Save category information to the Bunch object
ForMydirInCatelist:
Class_path = seg_path + mydir +"/"
File_list = OS. listdir (class_path)
ForFile_pathInFile_list:
Fullname = class_path + file_path
Bunch. label. append (mydir)#Save the classification label of the current file
Bunch. filenames. append (fullname)#Save the file path of the current file
Bunch. contents. append (readfile (fullname). strip ())#Save file word Vectors

# Bunch Object Persistence
File_obj = open (wordbag_path,"Wb")
Pickle. dump (bunch, file_obj)
File_obj.close ()

Print ("End of building Text object")

 

3. vector space model

Because the text is stored in a non-vector space with a high dimension, to save storage space and improve search efficiency, some words are automatically filtered out before text classification, these words or words are called deprecated words. You can click here to download these words or words.

4. Weight strategy: TF-IDF Method

If a word or phrase appears frequently in an article and rarely appears in other articles, the word or phrase is considered to have good classification ability and is suitable for classification.

Before providing this part of code, let's look at the concepts of Word Frequency and reverse file frequency.

Term Frequency (TF): the frequency at which a given word appears in the file. This number is the normalization of the number of words to prevent it from being biased towards long files. For words in a specific file, its importance can be expressed:

 

The numerator is the number of times the word appears in the file, and the denominator is the sum of the occurrences of all words in the file.

The frequency of reverse file (IDF) is a measure of the general importance of a word. The IDF of a specific word can be divided by the total number of files containing the word, then take the obtained quotient to the logarithm:

 

| D | indicates the total number of files in the corpus. j indicates the number of files containing words. If the word is not in the corpus, the denominator is zero. Therefore, an additional 1 is required for the denominator.

After calculating the product of Word Frequency and reverse file frequency, the high word frequency in a specific file, and the low file frequency of the word in the entire file set, can produce a high weight TF-IDF, therefore, TF-IDF tends to filter out common words and retain important words. The Code is as follows:

 

ImportOS
FromSklearn. datasets. baseImportBunch
ImportPickle#Persistence class
FromSklearnImportFeature_extraction
FromSklearn. feature_extraction.textImportTfidfTransformer# TF-IDFVector Conversion
FromSklearn. feature_extraction.textImportTfidfVectorizer# TF-IDFVector Generation class

DefReadbunchobj (path ):
File_obj = open (path,"Rb")
Bunch = pickle. load (file_obj)
File_obj.close ()
ReturnBunch
DefWritebunchobj (path, bunchobj ):
File_obj = open (path,"Wb")
Pickle. dump (bunchobj, file_obj)
File_obj.close ()

DefReadfile (path ):
Fp = open (path,"R", Encoding ='Gb2312', Errors ='Ignore')
Content = fp. read ()
Fp. close ()
ReturnContent


Path ="Train_word_bag/train_set.dat"
Bunch = readbunchobj (path)

#Deprecated word
Stopword_path ="Train_word_bag/hlt_stop_words.txt"
Stpwrdlst = readfile (stopword_path). splitlines ()
#Construct a TF-IDF vector space object
Tfidfspace‑bunch(target_name‑bunch.tar get_name, label = bunch. label, filenames = bunch. filenames, tdm = [], vocabulary = {})
#Use TfidVectorizer to initialize the vector space model
Vectorizer = TfidfVectorizer (stop_words = stpwrdlst, sublinear_tf =True, Max_df = 0.5)
Transfoemer = TfidfTransformer ()#This class counts the TF-IDF weights for each word

# Convert text into Word Frequency matrix and save dictionary files separately
Tfidfspace. tdm = vectorizer. fit_transform (bunch. contents)
Tfidfspace. vocabulary = vectorizer. vocabulary _

#Create bag-of-words persistence
Space_path ="Train_word_bag/tfidfspace. dat"
Writebunchobj (space_path, tfidfspace)

 

 

5. Use Naive Bayes classification Module

Common text classification methods include kNN nearest neighbor, Naive Bayes, and SVM. Generally:

The kNN algorithm is the simplest, and the classification accuracy is acceptable, but the speed is the fastest

The naive Bayes algorithm has the best effect on short text classification and high accuracy.

The advantage of the SVM algorithm is that it supports linear inseparable situations, and the accuracy is moderate.

The operations in the code above are the data of the training set. below is the test set (word extraction training set). The training steps are the same as those of the training set. First, word segmentation is used, and then word vector files are generated, until the word vector model is generated, the difference is that when training the word vector model, you need to load the training set bag of words to map the word vector generated by the test set to the dictionary of the word bag in the training set, generate a vector space model using the following code:

 

ImportOS
FromSklearn. datasets. baseImportBunch
ImportPickle#Persistence class
FromSklearnImportFeature_extraction
FromSklearn. feature_extraction.textImportTfidfTransformer# TF-IDFVector Conversion
FromSklearn. feature_extraction.textImportTfidfVectorizer# TF-IDFVector Generation class

FromTF_IDFImportSpace_path


DefReadbunchobj (path ):
File_obj = open (path,"Rb")
Bunch = pickle. load (file_obj)
File_obj.close ()
ReturnBunch
DefWritebunchobj (path, bunchobj ):
File_obj = open (path,"Wb")
Pickle. dump (bunchobj, file_obj)
File_obj.close ()
DefReadfile (path ):
Fp = open (path,"R", Encoding ='Gb2312', Errors ='Ignore')
Content = fp. read ()
Fp. close ()
ReturnContent

#Import word vector bunch object after word segmentation
Path ="Test_word_bag/test_set.dat"
Bunch = readbunchobj (path)

#Deprecated word
Stopword_path ="Train_word_bag/hlt_stop_words.txt"
Stpwrdlst = readfile (stopword_path). splitlines ()

#Construct Test Set TF-IDF Vector Space
Testspace‑bunch(target_name‑bunch.tar get_name, label = bunch. label, filenames = bunch. filenames, tdm = [], vocabulary = {})

#Bag of words imported into the training set
Trainbunch = readbunchobj ("Train_word_bag/tfidfspace. dat")

#Use TfidfVectorizer to initialize Vector Space
Vectorizer = TfidfVectorizer (stop_words = stpwrdlst, sublinear_tf =True, Max_df= 0.5, vocabulary = trainbunch. vocabulary)
Transformer = TfidfTransformer ();
Testspace. tdm = vectorizer. fit_transform (bunch. contents)
Testspace. vocabulary = trainbunch. vocabulary

#Create bag-of-words persistence
Space_path ="Test_word_bag/testspace. dat"
Writebunchobj (space_path, testspace)

Run the polynomial Bayesian algorithm to test the text classification and return the accuracy. The Code is as follows:

 

ImportPickle
FromSklearn. naive_bayesImportMultinomialNB#Import polynomial Bayes algorithm package

DefReadbunchobj (path ):
File_obj = open (path,"Rb")
Bunch = pickle. load (file_obj)
File_obj.close ()
ReturnBunch


#Import training set vector space
Trainpath ="Train_word_bag/tfidfspace. dat"
Train_set = readbunchobj (trainpath)
# DImport Test Set Vector Space
Testpath ="Test_word_bag/testspace. dat"
Test_set = readbunchobj (testpath)
#Applying Bayesian Algorithms
# Alpha: the smaller the 0.001 alpha, the more iterations, the higher the accuracy
Clf = MultinomialNB (alpha = 0.001). fit (train_set.tdm, train_set.label)

#Prediction classification result
Predicted = clf. predict (test_set.tdm)
Total = len (predicted); rate = 0
ForFlabel, file_name, expct_cateInZip (test_set.label, test_set.filenames, predicted ):
IfFlabel! = Expct_cate:
Rate + = 1
Print (file_name,":Actual category :", Flabel,"-->Prediction type :", Expct_cate)
#Precision
Print ("Error_rate :", Float (rate) * 100/float (total ),"%")

 

 

6. Classification Result Evaluation

Algorithm evaluation in the machine learning field has three basic indicators:

  • · Recall rate: This is the ratio of the number of retrieved documents to all relevant documents in the document library. It measures the retrieval rate of the retrieval system.

Recall rate = system-retrieved related files/system-related file reviews

  • · Precision: the ratio of the retrieved documents to the total number of retrieved documents. It measures the Precision of the retrieval system.

Accuracy = related files retrieved by the system/number of All retrieved files by the System

Accuracy and recall rate affect each other. Ideally, both of them are high, but in general, the accuracy is high and the recall rate is low. The recall rate is high and the accuracy is low.

  • · F-Score (): the formula is:

 

When = 1 is the most common-Measure

The relationship is as follows:

 

The specific evaluation code is as follows:



ImportNumpyAsNp
FromSklearnImportMetrics
#Evaluation
DefMetrics_result (actual, predict ):
Print ("Precision: {0:. 3f }". Format (metrics. precision_score (actual, predict )))
Print ("Recall: {0. 3f }". Format (metrics. recall_score (actual, predict )))
Print ("F1-score: {0:. 3f }". Format (metrics. f1_score (actual, predict )))
Metrics_result (test_set.label, predicted)

Chinese text corpus
Chinese stopword text set
All project code
Original article link

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.