Use python to implement a small text classification system

Source: Internet
Author: User
Tags svm idf
Text mining refers to the process of extracting unknown, understandable, and ultimately available knowledge from a large amount of text data, and using this knowledge to better organize information for future reference. That is, the process of searching for knowledge from unstructured texts. Background

Text mining refers to the process of extracting unknown, understandable, and ultimately available knowledge from a large amount of text data, and using this knowledge to better organize information for future reference. That is, the process of searching for knowledge from unstructured texts.

Currently, text mining has seven main fields:

  • · Search and information retrieval IR

  • · Text clustering: Use a clustering method to group and classify words, segments, paragraphs, or files.

  • · Text classification: Groups and classifies fragments, paragraphs, or files. based on the data mining classification method, the instance model is marked as trained.

  • · Web Mining: data and text mining on the Internet, with special attention to the scale and interconnectivity of networks

  • · Information extraction IE: identifies and extracts relevant facts and relationships from unstructured texts; and extracts structured data from unstructured or semi-structured texts.

  • · Natural language processing (NLP): discovering the structure and meaning of language essence from the perspective of syntax and semantics

Text Classification System (python 3.5)

The text classification technology and process of Chinese language mainly includes the following steps:
1. preprocessing: remove text noise information, such as HTML tags, text format conversion, and sentence boundary detection

2. Chinese Word Segmentation: use a Chinese word divider for text word segmentation and remove deprecated words

3. construct the word vector space: count the word frequency of the text and generate the word vector space of the text

4. weight strategy-TF-IDF: use TF-IDF to discover feature words and extract features that reflect the topic of the document

5. classification words: use algorithms to train classifiers

6. evaluation classification results

1. preprocessing

A. select the text processing range.

B. establish a text corpus for classification

  • · Training set Corpus

Text resources of classes already divided

  • · Test set Corpus

The text corpus to be classified can be a part of the training set or an external source text corpus.

C. text format conversion: remove html tags using the Python lxml Library

D. checking sentence boundaries: marking the end of a sentence

2. chinese word segmentation

Word Segmentation refers to the process of re-composing word sequences based on certain specifications. chinese word segmentation refers to dividing a Chinese character sequence (sentence) into independent words. chinese word segmentation is very complicated, to some extent, it is not entirely an algorithm problem. finally, probability theory solves this problem. The algorithm is a conditional random field (CRF) based on the probability graph model)

Word segmentation is the most basic and underlying module in natural language processing. word segmentation accuracy has a great impact on subsequent application modules. structured representation of text or sentences is the core task in language processing, currently, structured representation of text can be divided into four categories: word vector space, subject model, tree representation of dependency syntax, and graph representation of RDF.

The following is a sample code for Chinese words:

#-*-Coding: UTF-8-*-import OS
Import jieba
Def savefile (savepath, content ):
Fp = open (savepath, "w", encoding = 'gb2312', errors = 'ignore ')
Fp. write (content)
Fp. close ()
Def readfile (path ):
Fp = open (path, "r", encoding = 'gb2312', errors = 'ignore ')
Content = fp. read ()
Fp. close ()
Return content
# Corpus_path = "train_small/" # path of the unsegmented classification prediction database
# Seg_path = "train_seg/" # corpus_path = "test_small/" # seg_path = "test_seg/" # catelist = OS. listdir (corpus_path) # obtain all subdirectories in the modified directory for mydir in catelist:
Class_path = corpus_path + mydir + "/" # spell out the path of the category subdirectory
Seg_dir = seg_path + mydir + "/" # spell out the expected category directory after word segmentation
If not OS. path. exists (seg_dir): # whether it exists. if it does not exist, create
OS. makedirs (seg_dir)
File_list = OS. listdir (class_path)
For file_pathin file_list:
Fullname = class_path + file_path
Content = readfile (fullname). strip () # read file content
Content = content. replace ("\ r \ n", ""). strip () # delete line breaks and extra spaces
Content_seg = jieba. cut (content)
Savefile (seg_dir + file_path, "". join (content_seg ))
Print ("Word Segmentation ends ")

For the convenience of generating the word vector space model in the future, the text information after word segmentation must be converted into text vector information and converted to object. The data structure of the Scikit-Learn library is used. the code is as follows:

Import osimport picklefrom sklearn. datasets. base import Bunch # The Bunch class provides a key, object form of value # target_name name list of all classification sets # label category label list of each file # filenames file path # After contents word segmentation, file word vector form def readfile (path ): fp = open (path, "r", encoding = 'gb2312', errors = 'ignore') content = fp. read () fp. close () return contentbunch = Bunch (target_name = [], label = [], filenames = [], contents = []) # wordbag_path = "inline/inline" # seg_path = "train_seg/" wordbag_path = "test_word_bag/test_set.dat" seg_path = "test_seg/" incluget_name.extend (catelist) # save the category information to the Bunch object for mydir in catelist: class_path = seg_path + mydir + "/" file_list = OS. listdir (class_path) for file_path in file_list: fullname = class_path + file_path bunch. label. append (mydir) # save the classification tag bunch of the current file. filenames. append (fullname) # save the file path bunch of the current file. contents. append (readfile (fullname ). strip () # save the file word vector # Bunch object persistence file_obj = open (wordbag_path, "wb") pickle. dump (bunch, file_obj) file_obj.close () print ("Build text object ends ")

3. vector space model

Because the text is stored in a non-vector space with a high dimension, to save storage space and improve search efficiency, some words are automatically filtered out before text classification, these words or words are called deprecated words. you can click here to download these words or words.

4. weight strategy: TF-IDF method

If a word or phrase appears frequently in an article and rarely appears in other articles, the word or phrase is considered to have good classification ability and is suitable for classification.

Before providing this part of code, let's look at the concepts of word frequency and reverse file frequency.

Term frequency (TF): the frequency at which a given word appears in the file. This number is the normalization of the number of words to prevent it from being biased towards long files. for words in a specific file, its importance can be expressed:

The numerator is the number of times the word appears in the file, and the denominator is the sum of the occurrences of all words in the file.

The frequency of reverse File (IDF) is a measure of the general importance of a word. the IDF of a specific word can be divided by the total number of files containing the word, then take the obtained quotient to the logarithm:

| D | indicates the total number of files in the corpus. j indicates the number of files containing words. if the word is not in the corpus, the denominator is zero. Therefore, an additional 1 is required for the denominator.

After calculating the product of word frequency and reverse file frequency, the high word frequency in a specific file, and the low File frequency of the word in the entire file set, can produce a high weight TF-IDF, therefore, TF-IDF tends to filter out common words and retain important words. The code is as follows:

Import osfrom sklearn. datasets. base import Bunchimport pickle # persistence class from sklearn import feature_extractionfrom sklearn. feature_extraction.text import TfidfTransformer # TF-IDF vector conversion class from sklearn. feature_extraction.text import TfidfVectorizer # TF-IDF vector generation class def readbunchobj (path): file_obj = open (path, "rb") bunch = pickle. load (file_obj) file_obj.close () return bunchdef writebunchobj (path, bunchobj): file_obj = open (path, "wb") pickle. dump (bunchobj, file_obj) file_obj.close () def readfile (path): fp = open (path, "r", encoding = 'gb2312', errors = 'ignore ') content = fp. read () fp. close () return contentpath = "train_word_bag/train_set.dat" bunch = readbunchobj (path) # stopword_path = "train_word_bag/hlt_stop_words.txt" comment get_name, label = bunch. label, filenames = bunch. filenames, tdm = [], vocabulary = {}) # use delimiter to initialize the vector space model vectorizer = compile (stop_words = stpwrdlst, sublinear_tf = True, max_df = 0.5) transfoemer = TfidfTransformer () # This class will count the TF-IDF weight of each word # text into word frequency matrix, save the dictionary file tfidfspace separately. tdm = vectorizer. fit_transform (bunch. contents) tfidfspace. vocabulary = vectorizer. vocabulary _ # create persistent space_path = "train_word_bag/tfidfspace. dat "writebunchobj (space_path, tfidfspace)

5. use naive Bayes classification module

Common text classification methods include kNN nearest neighbor, naive Bayes, and SVM. generally:

The kNN algorithm is the simplest, and the classification accuracy is acceptable, but the speed is the fastest

The naive Bayes algorithm has the best effect on short text classification and high accuracy.

The advantage of the SVM algorithm is that it supports linear inseparable situations, and the accuracy is moderate.

The operations in the code above are the data of the training set. below is the test set (word extraction training set). The training steps are the same as those of the training set. first, word segmentation is used, and then word vector files are generated, until the word vector model is generated, the difference is that when training the word vector model, you need to load the training set bag of words to map the word vector generated by the test set to the Dictionary of the word bag in the training set, generate a vector space model using the following code:

Import osfrom sklearn. datasets. base import Bunchimport pickle # persistence class from sklearn import feature_extractionfrom sklearn. feature_extraction.text import TfidfTransformer # TF-IDF vector conversion class from sklearn. feature_extraction.text import TfidfVectorizer # class of TF-IDF vector generation from TF_IDF import space_pathdef readbunchobj (path): file_obj = open (path, "rb") bunch = pickle. load (file_obj) file_obj.close () return bunchdef writebunchobj (path, bunchobj): file_obj = open (path, "wb") pickle. dump (bunchobj, file_obj) file_obj.close () def readfile (path): fp = open (path, "r", encoding = 'gb2312', errors = 'ignore ') content = fp. read () fp. close () return content # import the word vector bunch object after word segmentation path = "test_word_bag/test_set.dat" bunch = readbunchobj (path) # deprecated word stopword_path = "train_word_bag/hlt_stop_words.txt, label = bunch. label, filenames = bunch. filenames, tdm = [], vocabulary = {}) # import the word bag of the training set trainbunch = readbunchobj ("train_word_bag/tfidfspace. dat ") # use TfidfVectorizer to initialize vector space vectorizer = TfidfVectorizer (stop_words = stpwrdlst, sublinear_tf = True, max_df = 0.5, vocabulary = trainbunch. vocabulary) transformer = TfidfTransformer (); testspace. tdm = vectorizer. fit_transform (bunch. contents) testspace. vocabulary = trainbunch. vocabulary # create persistent space_path = "test_word_bag/testspace. dat "writebunchobj (space_path, testspace)

Run the polynomial Bayesian algorithm to test the text classification and return the accuracy. the code is as follows:

Import picklefrom sklearn. naive_bayes import MultinomialNB # import the polynomial Bayesian algorithm package def readbunchobj (path): file_obj = open (path, "rb") bunch = pickle. load (file_obj) file_obj.close () return bunch # import the training set vector space trainpath = "train_word_bag/tfidfspace. dat "train_set = readbunchobj (trainpath) # d import the test set vector space testpath =" test_word_bag/testspace. dat "test_set = readbunchobj (testpath) # apply Bayesian Algorithm # alpha: The smaller the alpha, the more iterations, the higher the accuracy. c Lf = MultinomialNB (alpha = 0.001 ). fit (train_set.tdm, train_set.label) # prediction classification result predicted = clf. predict (test_set.tdm) total = len (predicted); rate = 0for flabel, file_name, expct_cate in zip (test_set.label, test_set.filenames, predicted): if flabel! = Expct_cate: rate + = 1 print (file_name, ": actual Category:", flabel, "--> prediction Category:", expct_cate) # precision print ("error_rate :", float (rate) * 100/float (total), "% ")

6. classification result evaluation

Algorithm evaluation in the machine learning field has three basic indicators:

  • · Recall rate: this is the ratio of the number of retrieved documents to all relevant documents in the document library. it measures the retrieval rate of the retrieval system.

Recall rate = System-Retrieved related files/system-related file reviews

  • · Precision: the ratio of the retrieved documents to the total number of retrieved documents. it measures the Precision of the retrieval system.

Accuracy = related files retrieved by the system/number of all Retrieved files by the system

Accuracy and recall rate affect each other. Ideally, both of them are high, but in general, the accuracy is high and the recall rate is low. the recall rate is high and the accuracy is low.

  • · F-Score (): The formula is:

When = 1 is the most common-Measure

The relationship is as follows:

The specific evaluation code is as follows:

Import numpy as npfrom sklearn import metrics # evaluate def metrics_result (actual, predict): print ("precision: {0 :. 3f }". format (metrics. precision_score (actual, predict) print ("recall: {0: 0. 3f }". format (metrics. recall_score (actual, predict) print ("f1-score: {0 :. 3f }". format (metrics. f1_score (actual, predict) metrics_result (test_set.label, predicted) Chinese text corpus Chinese deprecated word text set project all code original link

The preceding section describes how to use python to implement a small text classification system. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.