Http://www.ithao123.cn/content-296918.html
Home > Technology > Programming > Python > Python text mining: Simple Natural language Statistics Python text mining: Simple Natural language statistics
2015-05-12 Views (141)
[Summary: First application NLTK (Natural Language Toolkit) sequential package. In fact, the time of analyzing emotions in a rigid learning style has already applied the simple punishment and statistics of natural speech disposal. For example, the text after the word is changed into a word (or word order)
The nltk (Natural Language Toolkit) package is used primarily.
In fact, before the use of machine learning methods to analyze the feelings of the use of simple natural language processing and statistics. For example, the text after the word into a two-word collocation (or two-word sequence), find the most frequent words in the corpus, use certain statistical methods to find the most abundant words. Look back.
1. Turn text into two-word collocation or three-word collocation
Import NLTK
Example_1 = [' I ', ' am ', ' a ', ' big ', ' apple ', '. ']
Print Nltk.bigrams (example_1)
>> [(' I ', ' AM '), (' AM ', ' a '), (' A ', ' big '), (' Big ', ' Apple '), (' Apple ', '. ')]
Print Nltk.trigrams (example)
>> [(' I ', ' am ', ' a '), (' AM ', ' a ', ' big '), (' A ', ' big ', ' Apple '), (' Big ', ' apple ', '. ')]
2. Find the most frequently used words in a corpus
From nltk.probability import freqdist
example_2 = [' i ', ' am ', ' a ', ' big ', ' apple ', '. ', ' I ', ' am ', ' delicious ', ', ', ' I ', ' smells ', ' good ', '. ', ' I ', ' taste ', ' good ', ‘.‘]
Fdist = freqdist (word for word in example_2) #把文本转化成词和词频的字典
Print Fdist.keys () #词按出现频率由高到低排列
>> [' I ', '. ', ' am ', ' good ', ', ', ' a ', ' apple ', ' big ', ' delicious ', ' smells ', ' taste ']
Print fdist.values () #语料中每个词的出现次数倒序排列
>> [4, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1]
3. Find the most informative words
Import NLTK
From nltk.collocations import Bigramcollocationfinder
From Nltk.metrics import bigramassocmeasures
Bigrams = Bigramcollocationfinder.from_words (example_2)
Most_informative_chisq_bigrams = Bigrams.nbest (bigramassocmeasures.chi_sq, 3) #使用卡方统计法找
Most_informative_pmi_bigrams = Bigrams.nbest (BIGRAMASSOCMEASURES.PMI, 3) #使用互信息方法找
Print Most_informative_chisq_bigrams
>> [(' A ', ' big '), (' Big ', ' Apple '), (' Delicious ', ', ')]
Print Most_informative_pmi_bigrams
>> [(' A ', ' big '), (' Big ', ' Apple '), (' Delicious ', ', ')]
The next step is to start a simple natural language text statistic.These include statistics: number of words, number of sentences, number of words in different parts of speech, information entropy, confusion value, etc. these are much easier than the above.
1. Number of statistical terms and number of sentencesthis in Python as long as the text clauses, participle and in the form of multi-dimensional array to store, and then you can directly calculate the amount of words and sentences. the form of the text is as follows:[[ "Phone", "screen", "very", "good", ","], ["Lens", "also", "good", ". "], [[" Mobile Phone "," good "," rotten ",", "], [" No "," method "," endure "," Up ","! ”, “! "] ]Because of the use of multidimensional arrays, it is possible to iterate through the array directly with for, and the Len () function can be used to derive the number of sentences and the number of words.
Sent_num = Len (sents)
Word_num = Len (words)
2. Statistics on the number of words in different parts of speech
Import jieba.posseg
def postagger (sentence, para):
Pos_data = jieba.posseg.cut (sentence)
Pos_list = []
For W in Pos_data:
Pos_list.append ((W.word, W.flag)) #make every word and tag as a tuple and add them to a list
Return pos_list
def count_adj_adv (All_review): Number of #只统计形容词, adverbs, and verbs
Adj_adv_num = []
A = 0
d = 0
v = 0
For review in All_review:
pos = Tp.postagger (review, ' list ')
For I in POS:
If i[1] = = ' A ':
A + = 1
Elif i[1] = = ' d ':
D + = 1
Elif i[1] = = ' V ':
V + = 1
Adj_adv_num.append ((A, D, v))
A = 0
d = 0
v = 0
Return Adj_adv_num
3. Calculate information entropy and confusion values using NLTKThe information entropy has many meanings, when it is used together with the confusion value, they have the specific meaning, mainly is expresses one information "the Surprise degree" (surprising). Suppose a corpus, if the content of one text is similar to other text, its entropy and confusion value is very small. And if the content of a text and other text is very different, it is very "surprising", at this time its entropy and confusion value is large. The method of calculating information entropy and confusing value is provided in NLTK. It is necessary to "train" an information entropy and perplexity value model with all the text, and then use this "model" to calculate the entropy of information and the confusion value of each text.
From Nltk.model.ngram import Ngrammodel
Example_3 = [[' I ', ' am ', ' a ', ' big ', ' apple ', '. '], [' I ', ' am ', ' delicious ', ', '], [' I ', ' smells ', ' good ', '. ', ' I ', ' Taste ', ' Good ', '. ']
Train = List (Itertools.chain (*example_3)) #把数据变成一个一维数组 to train the model
Ent_per_model = Ngrammodel (1, train, Estimator=none) #训练一元模型, the model calculates information entropy and confusion values
def entropy_perplexity (model, DataSet):
EP = []
For R in DataSet:
ent = model.entropy (r)
per = Model.perplexity (r)
Ep.append (ent, per)
Return EP
Home > Technology > Programming > Python > Python text mining: Simple Natural language Statistics Python text mining: Simple Natural language statistics
2015-05-12 Views (141)
[Summary: First application NLTK (Natural Language Toolkit) sequential package. In fact, the time of analyzing emotions in a rigid learning style has already applied the simple punishment and statistics of natural speech disposal. For example, the text after the word is changed into a word (or word order)
The nltk (Natural Language Toolkit) package is used primarily.
In fact, before the use of machine learning methods to analyze the feelings of the use of simple natural language processing and statistics. For example, the text after the word into a two-word collocation (or two-word sequence), find the most frequent words in the corpus, use certain statistical methods to find the most abundant words. Look back.
1. Turn text into two-word collocation or three-word collocation
Import NLTK
Example_1 = [' I ', ' am ', ' a ', ' big ', ' apple ', '. ']
Print Nltk.bigrams (example_1)
>> [(' I ', ' AM '), (' AM ', ' a '), (' A ', ' big '), (' Big ', ' Apple '), (' Apple ', '. ')]
Print Nltk.trigrams (example)
>> [(' I ', ' am ', ' a '), (' AM ', ' a ', ' big '), (' A ', ' big ', ' Apple '), (' Big ', ' apple ', '. ')]
2. Find the most frequently used words in a corpus
From nltk.probability import freqdist
example_2 = [' i ', ' am ', ' a ', ' big ', ' apple ', '. ', ' I ', ' am ', ' delicious ', ', ', ' I ', ' smells ', ' good ', '. ', ' I ', ' taste ', ' good ', ‘.‘]
Fdist = freqdist (word for word in example_2) #把文本转化成词和词频的字典
Print Fdist.keys () #词按出现频率由高到低排列
>> [' I ', '. ', ' am ', ' good ', ', ', ' a ', ' apple ', ' big ', ' delicious ', ' smells ', ' taste ']
Print fdist.values () #语料中每个词的出现次数倒序排列
>> [4, 3, 2, 2, 1, 1, 1, 1, 1, 1, 1]
3. Find the most informative words
Import NLTK
From nltk.collocations import Bigramcollocationfinder
From Nltk.metrics import bigramassocmeasures
Bigrams = Bigramcollocationfinder.from_words (example_2)
Most_informative_chisq_bigrams = Bigrams.nbest (bigramassocmeasures.chi_sq, 3) #使用卡方统计法找
Most_informative_pmi_bigrams = Bigrams.nbest (BIGRAMASSOCMEASURES.PMI, 3) #使用互信息方法找
Print Most_informative_chisq_bigrams
>> [(' A ', ' big '), (' Big ', ' Apple '), (' Delicious ', ', ')]
Print Most_informative_pmi_bigrams
>> [(' A ', ' big '), (' Big ', ' Apple '), (' Delicious ', ', ')]
The next step is to start a simple natural language text statistic.These include statistics: number of words, number of sentences, number of words in different parts of speech, information entropy, confusion value, etc. these are much easier than the above.
1. Number of statistical terms and number of sentencesthis in Python as long as the text clauses, participle and in the form of multi-dimensional array to store, and then you can directly calculate the amount of words and sentences. the form of the text is as follows:[[ "Phone", "screen", "very", "good", ","], ["Lens", "also", "good", ". "], [[" Mobile Phone "," good "," rotten ",", "], [" No "," method "," endure "," Up ","! ”, “! "] ]Because of the use of multidimensional arrays, it is possible to iterate through the array directly with for, and the Len () function can be used to derive the number of sentences and the number of words.
Sent_num = Len (sents)
Word_num = Len (words)
2. Statistics on the number of words in different parts of speech
Import jieba.posseg
def postagger (sentence, para):
Pos_data = jieba.posseg.cut (sentence)
Pos_list = []
For W in Pos_data:
Pos_list.append ((W.word, W.flag)) #make every word and tag as a tuple and add them to a list
Return pos_list
def count_adj_adv (All_review): Number of #只统计形容词, adverbs, and verbs
Adj_adv_num = []
A = 0
d = 0
v = 0
For review in All_review:
pos = Tp.postagger (review, ' list ')
For I in POS:
If i[1] = = ' A ':
A + = 1
Elif i[1] = = ' d ':
D + = 1
Elif i[1] = = ' V ':
V + = 1
Adj_adv_num.append ((A, D, v))
A = 0
d = 0
v = 0
Return Adj_adv_num
3. Calculate information entropy and confusion values using NLTKThe information entropy has many meanings, when it is used together with the confusion value, they have the specific meaning, mainly is expresses one information "the Surprise degree" (surprising). Suppose a corpus, if the content of one text is similar to other text, its entropy and confusion value is very small. And if the content of a text and other text is very different, it is very "surprising", at this time its entropy and confusion value is large. The method of calculating information entropy and confusing value is provided in NLTK. It is necessary to "train" an information entropy and perplexity value model with all the text, and then use this "model" to calculate the entropy of information and the confusion value of each text.
From Nltk.model.ngram import Ngrammodel
Example_3 = [[' I ', ' am ', ' a ', ' big ', ' apple ', '. '], [' I ', ' am ', ' delicious ', ', '], [' I ', ' smells ', ' good ', '. ', ' I ', ' Taste ', ' Good ', '. ']
Train = List (Itertools.chain (*example_3)) #把数据变成一个一维数组 to train the model
Ent_per_model = Ngrammodel (1, train, Estimator=none) #训练一元模型, the model calculates information entropy and confusion values
def entropy_perplexity (model, DataSet):
EP = []
For R in DataSet:
ent = model.entropy (r)
per = Model.perplexity (r)
Ep.append (ent, per)
Return EP
Print entropy_perplexity (Ent_per_model, Example_3)
>> [(4.152825201361557, 17.787911185335403), (4.170127240384194, 18.002523441208137),
Entropy and Information Gain
Import Math,nltkdef Entropy (labels): freqdist = nltk. Freqdist (labels) probs = [Freqdist.freq (l) for L in Freqdist] return-sum (P * Math.log (p,2) for p in probs)
As was mentioned before, there is several methods for identifying the very informative feature for a decision stump. O NE Popular alternative, called information gain, measures how much more organized the input values B Ecome when we divide them up using a given feature. To measure how disorganized the original set of input values is, we calculate entropy of their labels, which would be is high If the input values has highly varied labels, and low if many input values all has the same label. In particular, entropy are defined as the sum of the probability of each label times the log probability of that same label :
(1) |
|
H =? Σl |in| LabelsP (l) x log2P (L). |
Figure 4.2:the Entropy of labels in the name Gender prediction task, as a function of the percentage of names in a given Set that is male.
For example, 4.2 shows how the entropy of labels in the name Gender prediction task depends on the ratio of male to FEM Ale names. Note if the most input values has the same label (e.g., if P (male) is near 0 or near 1) and then entropy was low. In particular, labels that has low frequency does not contribute much to the entropy (since p (l) is smal L), and labels with high frequency also does not contribute much to the entropy (since log2p (l) is small). On the other hand, if the input values has a wide variety of labels, then there is many labels with a "medium" frequency , where neither p (L) nor log2p (L) is small, so the entropy is hig H. 4.3 Demonstrates how to calculate the entropy of a list of labels.
|
Import MathEntropy (labels): freqdist = nltk. Freqdist (labels) in freqdist] in probs) |
|
|
>>>Print (Entropy ([' Male ',' Male ',' Male ',' Male ']))0.0>>>Print (Entropy ([' Male ',' Female ',' Male ',' Male ']))0.811 ...>>>print (Entropy ([ ' female ', ' female ', ' male ')) 1.0< Span class= "Pysrc-output" >>>> print (Entropy ([ Span class= "pysrc-string" > ' female ', ' female ', ' male ', ' female ')) 0.811... >>> print (Entropy ([ " Female ', ' female ', ' female ', Female ']) 0.0 |
|
Example 4.3 (code_entropy.py): Figure 4.3:calculating The entropy of a List of Labels |
Once we have calculated the entropy of the original set of input values ' labels, we can determine what much more organized The labels become once we apply the decision stump. To doing so, we calculate the entropy for each of the decision stump's leaves, and take the average of those leaf entropy Val UEs (weighted by the number of samples in each leaf). The information gain is then equal to the original entropy minus this new, reduced entropy. The higher the information gain, the better job the decision stump does of dividing the input values into coherent groups, So we can build decision trees by selecting the decision stumps with the highest information gain.
Another consideration for decision trees is efficiency. The simple algorithm-selecting decision stumps described above must construct a candidate decision stump for every POS Sible feature, and this process must is repeated for every node in the constructed decision tree. A number of algorithms has been developed to cut down on the training time by storing and reusing information about Previ ously evaluated examples.
Decision Trees has a number of useful qualities. To begin with, the they ' re simple-to-understand, and easy-to-interpret. This was especially true near the top of the decision tree, where it was usually possible for the learning algorithm to find Very useful features. Decision trees is especially well suited to cases where many hierarchical categorical distinctions can be made. For example, decision trees can is very effective at capturing phylogeny trees.
However, decision trees also have a few disadvantages. One problem is so, since each branch in the decision tree splits the training data, the amount of training data availabl E to train nodes lower in the tree can become quite small. As a result, these lower decision nodes may
Overfit The training set, learning patterns that reflect idiosyncrasies of the training set rather than linguistically Significant patterns in the underlying problem. One solution to this problem are to stop dividing nodes once the amount of training data becomes too. Another solution is-grow a full decision tree, and then-to -prune decision nodes that does not improve Performanc E on a dev-test.
A second problem with decision trees was, they force features to being checked in a specific order, even when features Act relatively independently of one another. For example, when classifying documents into topics (such as sports, automotive, or murder mystery), features such as Hasword (football) is highly indicative of a specific label, regardless of what and the feature values are. Since there is limited space near the top of the decision tree, and most of the these features would need to being repeated on many di Fferent branches in the tree. And since the number of branches increases exponentially as we go down the tree, the amount of repetition can very larg E.
A related problem is that decision trees be not good at making use of features this is weak predictors of the correct LA Bel. Since these features make relatively small incremental improvements, they tend to occur very low in the decision tree. But by the time the decision tree learner have descended far enough to use these features, there are not enough training dat A left to reliably determine what effect they should has. If we could instead look at the effect of these features across the entire training set and then we might is able to make Som E conclusions about how they should affect the choice of the label.
The fact that decision trees require that features being checked in a specific order limits their ability to exploit features That is relatively independent of one another. The naive Bayes classification method, which we ' ll discuss next, overcomes this limitation by allowing all features to act "In parallel."
Natural language 26_perplexity Information