Chapter Nineth Analysis of text data and social media
1 Installation NLTK slightly
2 Filter Stop word name and number
The sample code is as follows:
ImportNLTK # Load English stop word corpus SW = set (Nltk.corpus.stopwords.words (' 中文版 ')) print (' Stop words ', list (sw) [: 7]) # Get the part of the Gutenberg Corpus
File GB = Nltk.corpus.gutenberg print (' Gutenberg files ', gb.fileids () [-5:]) # Take the first two sentences in the Milton-paradise.txt file as the filter statement used below Text_sent = gb.sents ("Milton-paradise.txt") [: 2] Print (' unfiltered ', text_sent) # filter Stop Word forSent intext_sent:filtered = [w forW inSentifW.lower () not inSW] Print (' filtered ', filtered) # Gets the label contained in the text tagged = nltk.pos_tag (filtered) print ("Tagged", tagged) words = [] forWord inTaggedifWord[1]!= ' NNP ' andWord[1]!= ' CD: Words.append (word[0]) print (words) # pos Callout Set # print (nltk.tag.tagset_mapping (' RU-RNC ', ' UN Iversal '))
The results of the operation are as follows:
Connected to Pydev Debugger (build162.1967.10)
Stop words [' he ', ' only ', ' because ', ' and ', ' each ', ' myself ', ' both ']
Gutenberg files [' Milton-paradise.txt ', ' shakespeare-caesar.txt ', ' shakespeare-hamlet.txt ', ' Shakespeare-macbeth.txt ', ' whitman-leaves.txt ']
unfiltered [[' [], ' Paradise ', ' Lost ', ' by ', ' John ', ' Milton ', ' 1667 ', '] ', [' book ', ' I ']]
filtered [' [', ' Paradise ', ' Lost ', ' John ', ' Milton ', ' 1667 ', '] '
Tagged [[', ' JJ '], (' Paradise ', ' NNP '), (' Lost ', ' NNP '), (' John ', ' NNP '), (' Milton ', ' NNP '), (' 1667 ', ' CD '), ('] ', ' NN ')]
['[', ']']
Filtered [' book ']
Tagged [(' Book ', ' NN ')]
[' book ']
The set of tags used in this example:
{' prp$ ',
' PDT ',
' CD ',
' EX ',
'.',
' NNS ',
' MD ',
' PRP ',
' RP ',
'(',
' VBD ',
'``',
"''",
' NN ', noun
' LS ',
' VBN ',
' Wrb ',
' In ', prepositions
' FW ',
' POS ',
' CC ', and conjunctions
':',
' DT ',
' Vbz ',
' RBS ',
' RBR ',
' wp$ ',
' RB ',
' SYM ',
' JJS ',
' Jjr ',
' UH ',
' WDT ',
'#',
',',
')',
' VB ',
' Nnps ',
' VBP ', verb
' NNP ',
' JJ ', adjective
' WP ',
' VBG ',
'$',
' to '} Word to
Roughly divided into the following 12 types of
' VERB ',
' NOUN ',
' PRON ',
' ADJ ',
' ADV ',
' ADP ',
' Conj ',
' DET ',
' NUM ',
' PRT ',
' X ',
'.'
3 Word Bag Model
Install Scikit-learn slightly
The sample code is as follows:
nltkcountvectorizer
# Loading the following two files from the Gutenberg corpus
GB = Nltk.corpus.gutenberg
Hamlet = Gb.raw (' shakespeare-hamlet.txt ')
Macbeth = Gb.raw ("Shakespeare-macbeth.txt")
# Remove English stop word
CV = Countvectorizer (stop_words= ' 中文版 ')
# Output part of the feature value
print ("Feature vector", Cv.fit_transform ([Hamlet, Macbeth]. ToArray ())
# The eigenvalues are sorted in alphabetical order
print (' Features ', Cv.get_feature_names () [: 5])
The results of the operation are as follows:
Feature vector [[1 0 1 ..., 14 0 1]
[0 1 0 ..., 1 1 0]]
Features [' 1599 ', ' 1603 ', ' abhominably ', ' abhorred ', ' abide ']
4 Frequency analysis
The sample code is as follows:
def PrintLine(values, num, Keyorvalue, tag): "" "" "to print the specified list of the NUM element key or value, the output tag is tag:p AramValues: List:p AramNum: Number of output elements:p AramKeyorvalue: The key of the output element or the value 0 is the key, 1 represents the value:p AramTag: Output label: return: "" "Tmpvalue = [] forKey inSorted (Values.items (), key=LambdaD:D[1], reverse=True) [: num]: Tmpvalue.append (Key[keyorvalue]) print (tag, ":", Tmpvalue) # Load Document GB = Nltk.corpus.gutenberg words = Gb.words ("Shakespeare-caesar.txt") # Support for deactivating words and punctuation marks SW = set (Nltk.corpus.stopwords.words (' 中文版 ')) punctuation = set ( string.punctuation) filtered = [W.lower () forW inWordsifW.lower () not inSw andW.lower () not inPunctuation] # Create Freqdist object with the highest output frequency key and value FD = NLTK. Freqdist (filtered) PrintLine (FD, 5, 0, "Wrods") PrintLine (FD, 5, 1, "Counts") # Most frequently used words and times print (' Max ', Fd.max ()) print (' Count ', fd[' Caesar ']) # The most common double word and word number FD = NLTK. Freqdist (Nltk.bigrams (filtered)) PrintLine (FD, 5, 0, "Bigrams") PrintLine (FD, 5, 1, "Counts") print (' Bigram Max ', Fd.max ( ) Print (' Bigram count ', fd[(' Let ', ' vs ')]) # The three words and the number of words most commonly appear fd = NLTK. Freqdist (Nltk.trigrams (filtered)) PrintLine (FD, 5, 0, "Trigrams") PrintLine (FD, 5, 1, "Counts") print (' Bigram Max ', fd.ma X ()) print (' Bigram count ', fd[(' Enter ', ' Lucius ', ' Luc ')]
The results of the operation are as follows:
Wrods: [' Caesar ', ' Brutus ', ' Bru ', ' haue ', ' shall ']
Counts: [190, 161, 153, 148, 125]
Max Caesar
Count 190
Bigrams: [' Let ', ' vs '], (' Wee ', ' l '), (' Mark ', ' Antony '), (' Marke ', ' Antony '), (' St ', ' Thou ')]
Counts: [16, 15, 13, 12, 12]
Bigram Max (' Let ', ' vs ')
Bigram Count 16
Trigrams: [(' Enter ', ' Lucius ', ' Luc '), (' Wee ', ' l ', ' Heare '), (' Thee ', ' thou ', ' st '), (' Beware ', ' Ides ', ' March '), (' Let ', ' vs ', ' heare ')]
Counts: [4, 4, 3, 3, 3]
Bigram Max (' Enter ', ' Lucius ', ' Luc ')
Bigram Count 4
5 naive Bayesian classification
is a probabilistic algorithm, based on the Bayesian theorem in probability and mathematical statistics
The sample code is as follows:
ImportNltkImportStringImportRandom # Deactivate words and punctuation sets SW = Set (Nltk.corpus.stopwords.words (' 中文版 ')) punctuation = set (string.punctuation) # takes word length as a featuredef word_features(word): return{' Len ': Len (Word)} # is either a deactivated word or a punctuation markdef Isstopword(word): returnWord inSworWord inPunctuation # Load File GB = Nltk.corpus.gutenberg words = gb.words ("Shakespeare-caesar.txt") # label words to distinguish whether or not to deactivate a word labeled_wor ds = ([(Word.lower (), Isstopword (Word.lower ())) forWord inWords] random.seed () random.shuffle (labeled_words) print (Labeled_words[:5)) # to find the length of each word as a characteristic value featuresets = [(Word_ Features (n), word) for(n, Word) inLabeled_words] # Training a naïve Bayesian classifier cutoff = Int (. 9 * len (featuresets)) # Create training datasets and test datasets Train_set, Test_set = Featuresets[:cut Off], Featuresets[cutoff:] # Check the classifier effect classifier = NLTK. Naivebayesclassifier.train (train_set) print ("' Behold ' class", Classifier.classify (word_features (' behold '))) print ( ' The ' class ', Classifier.classify (Word_features (' the ')) # calculates the classifier's accuracy according to the test dataset print ("Accuracy", Nltk.classify.accuracy (classifier, test_set)) # Maximum contribution feature print (Classifier.show_most_informative_features (5))
The results of the operation are as follows:
[(' I ', true), (' is ', true), (' by ', true), (' he ', true), (' ambitious ', False)]
' Behold ' class False
' The ' class True
Accuracy 0.8521671826625387
Most informative Features
Len = 7 False:true = 77.8:1
Len = 6 False:true = 52.2:1
Len = 1 True:false = 51.8:1
Len = 2 True:false = 10.9:1
Len = 5 False:true = 10.9:1
None
6 Affective Analysis
The sample code is as follows:
ImportRandom fromNltk.corpusImportMovie_reviews fromNltk.corpusImportStopwords fromNltkImportFreqdist fromNltkImportNaivebayesclassifier fromNltk.classifyImportAccuracyImportStringdef Getelementsbynum(values, num, keyorvalue): "" Gets the key or value of the NUM element of the specified list,:p AramValues: List:p AramNum: Number of elements:p AramKeyorvalue: Key for element or value 0 for key, 1 for value: return: "" "Tmpvalue = [] forKey inSorted (Values.items (), key=LambdaD:D[1], reverse=True) [: num]: Tmpvalue.append (Key[keyorvalue]) returnTmpvalue # Load Data Labeled_docs = [(