Python Data Analysis Learning notes Nine

Source: Internet
Author: User
Tags nltk

Chapter Nineth Analysis of text data and social media

1 Installation NLTK slightly

2 Filter Stop word name and number

The sample code is as follows:

ImportNLTK # Load English stop word corpus SW = set (Nltk.corpus.stopwords.words (' 中文版 ')) print (' Stop words ', list (sw) [: 7]) # Get the part of the Gutenberg Corpus
File GB = Nltk.corpus.gutenberg print (' Gutenberg files ', gb.fileids () [-5:]) # Take the first two sentences in the Milton-paradise.txt file as the filter statement used below Text_sent = gb.sents ("Milton-paradise.txt") [: 2] Print (' unfiltered ', text_sent) # filter Stop Word forSent intext_sent:filtered = [w forW inSentifW.lower () not inSW] Print (' filtered ', filtered) # Gets the label contained in the text tagged = nltk.pos_tag (filtered) print ("Tagged", tagged) words = [] forWord inTaggedifWord[1]!= ' NNP ' andWord[1]!= ' CD: Words.append (word[0]) print (words) # pos Callout Set # print (nltk.tag.tagset_mapping (' RU-RNC ', ' UN Iversal '))

The results of the operation are as follows:

Connected to Pydev Debugger (build162.1967.10)

Stop words [' he ', ' only ', ' because ', ' and ', ' each ', ' myself ', ' both ']

Gutenberg files [' Milton-paradise.txt ', ' shakespeare-caesar.txt ', ' shakespeare-hamlet.txt ', ' Shakespeare-macbeth.txt ', ' whitman-leaves.txt ']

unfiltered [[' [], ' Paradise ', ' Lost ', ' by ', ' John ', ' Milton ', ' 1667 ', '] ', [' book ', ' I ']]

filtered [' [', ' Paradise ', ' Lost ', ' John ', ' Milton ', ' 1667 ', '] '

Tagged [[', ' JJ '], (' Paradise ', ' NNP '), (' Lost ', ' NNP '), (' John ', ' NNP '), (' Milton ', ' NNP '), (' 1667 ', ' CD '), ('] ', ' NN ')]

['[', ']']

Filtered [' book ']

Tagged [(' Book ', ' NN ')]

[' book ']

The set of tags used in this example:

{' prp$ ',

' PDT ',

' CD ',

' EX ',

'.',

' NNS ',

' MD ',

' PRP ',

' RP ',

'(',

' VBD ',

'``',

"''",

' NN ', noun

' LS ',

' VBN ',

' Wrb ',

' In ', prepositions

' FW ',

' POS ',

' CC ', and conjunctions

':',

' DT ',

' Vbz ',

' RBS ',

' RBR ',

' wp$ ',

' RB ',

' SYM ',

' JJS ',

' Jjr ',

' UH ',

' WDT ',

'#',

',',

')',

' VB ',

' Nnps ',

' VBP ', verb

' NNP ',

' JJ ', adjective

' WP ',

' VBG ',

'$',

' to '} Word to

Roughly divided into the following 12 types of

' VERB ',
' NOUN ',
' PRON ',
' ADJ ',
' ADV ',
' ADP ',
' Conj ',
' DET ',
' NUM ',
' PRT ',
' X ',
'.'

3 Word Bag Model

Install Scikit-learn slightly

The sample code is as follows:


nltkcountvectorizer

# Loading the following two files from the Gutenberg corpus
GB = Nltk.corpus.gutenberg
Hamlet = Gb.raw (' shakespeare-hamlet.txt ')
Macbeth = Gb.raw ("Shakespeare-macbeth.txt")

# Remove English stop word
CV = Countvectorizer (stop_words= ' 中文版 ')
# Output part of the feature value
print ("Feature vector", Cv.fit_transform ([Hamlet, Macbeth]. ToArray ())
# The eigenvalues are sorted in alphabetical order
print (' Features ', Cv.get_feature_names () [: 5])

The results of the operation are as follows:

Feature vector [[1 0 1 ..., 14 0 1]

[0 1 0 ..., 1 1 0]]

Features [' 1599 ', ' 1603 ', ' abhominably ', ' abhorred ', ' abide ']

4 Frequency analysis

The sample code is as follows:

def PrintLine(values, num, Keyorvalue, tag): "" "" "to print the specified list of the NUM element key or value, the output tag is tag:p AramValues: List:p AramNum: Number of output elements:p AramKeyorvalue: The key of the output element or the value 0 is the key, 1 represents the value:p AramTag: Output label: return: "" "Tmpvalue = [] forKey inSorted (Values.items (), key=LambdaD:D[1], reverse=True) [: num]: Tmpvalue.append (Key[keyorvalue]) print (tag, ":", Tmpvalue) # Load Document GB = Nltk.corpus.gutenberg words = Gb.words ("Shakespeare-caesar.txt") # Support for deactivating words and punctuation marks SW = set (Nltk.corpus.stopwords.words (' 中文版 ')) punctuation = set ( string.punctuation) filtered = [W.lower () forW inWordsifW.lower () not inSw andW.lower () not inPunctuation] # Create Freqdist object with the highest output frequency key and value FD = NLTK. Freqdist (filtered) PrintLine (FD, 5, 0, "Wrods") PrintLine (FD, 5, 1, "Counts") # Most frequently used words and times print (' Max ', Fd.max ()) print (' Count ', fd[' Caesar ']) # The most common double word and word number FD = NLTK. Freqdist (Nltk.bigrams (filtered)) PrintLine (FD, 5, 0, "Bigrams") PrintLine (FD, 5, 1, "Counts") print (' Bigram Max ', Fd.max ( ) Print (' Bigram count ', fd[(' Let ', ' vs ')]) # The three words and the number of words most commonly appear fd = NLTK. Freqdist (Nltk.trigrams (filtered)) PrintLine (FD, 5, 0, "Trigrams") PrintLine (FD, 5, 1, "Counts") print (' Bigram Max ', fd.ma X ()) print (' Bigram count ', fd[(' Enter ', ' Lucius ', ' Luc ')]

The results of the operation are as follows:

Wrods: [' Caesar ', ' Brutus ', ' Bru ', ' haue ', ' shall ']

Counts: [190, 161, 153, 148, 125]

Max Caesar

Count 190

Bigrams: [' Let ', ' vs '], (' Wee ', ' l '), (' Mark ', ' Antony '), (' Marke ', ' Antony '), (' St ', ' Thou ')]

Counts: [16, 15, 13, 12, 12]

Bigram Max (' Let ', ' vs ')

Bigram Count 16

Trigrams: [(' Enter ', ' Lucius ', ' Luc '), (' Wee ', ' l ', ' Heare '), (' Thee ', ' thou ', ' st '), (' Beware ', ' Ides ', ' March '), (' Let ', ' vs ', ' heare ')]

Counts: [4, 4, 3, 3, 3]

Bigram Max (' Enter ', ' Lucius ', ' Luc ')

Bigram Count 4

5 naive Bayesian classification

is a probabilistic algorithm, based on the Bayesian theorem in probability and mathematical statistics

The sample code is as follows:

ImportNltkImportStringImportRandom # Deactivate words and punctuation sets SW = Set (Nltk.corpus.stopwords.words (' 中文版 ')) punctuation = set (string.punctuation) # takes word length as a featuredef word_features(word): return{' Len ': Len (Word)} # is either a deactivated word or a punctuation markdef Isstopword(word): returnWord inSworWord inPunctuation # Load File GB = Nltk.corpus.gutenberg words = gb.words ("Shakespeare-caesar.txt") # label words to distinguish whether or not to deactivate a word labeled_wor ds = ([(Word.lower (), Isstopword (Word.lower ())) forWord inWords] random.seed () random.shuffle (labeled_words) print (Labeled_words[:5)) # to find the length of each word as a characteristic value featuresets = [(Word_ Features (n), word) for(n, Word) inLabeled_words] # Training a naïve Bayesian classifier cutoff = Int (. 9 * len (featuresets)) # Create training datasets and test datasets Train_set, Test_set = Featuresets[:cut Off], Featuresets[cutoff:] # Check the classifier effect classifier = NLTK. Naivebayesclassifier.train (train_set) print ("' Behold ' class", Classifier.classify (word_features (' behold '))) print ( ' The ' class ', Classifier.classify (Word_features (' the ')) # calculates the classifier's accuracy according to the test dataset print ("Accuracy", Nltk.classify.accuracy (classifier, test_set)) # Maximum contribution feature print (Classifier.show_most_informative_features (5))

The results of the operation are as follows:

[(' I ', true), (' is ', true), (' by ', true), (' he ', true), (' ambitious ', False)]

' Behold ' class False

' The ' class True

Accuracy 0.8521671826625387

Most informative Features

Len = 7 False:true = 77.8:1

Len = 6 False:true = 52.2:1

Len = 1 True:false = 51.8:1

Len = 2 True:false = 10.9:1

Len = 5 False:true = 10.9:1

None

6 Affective Analysis

The sample code is as follows:

ImportRandom fromNltk.corpusImportMovie_reviews fromNltk.corpusImportStopwords fromNltkImportFreqdist fromNltkImportNaivebayesclassifier fromNltk.classifyImportAccuracyImportStringdef Getelementsbynum(values, num, keyorvalue): "" Gets the key or value of the NUM element of the specified list,:p AramValues: List:p AramNum: Number of elements:p AramKeyorvalue: Key for element or value 0 for key, 1 for value: return: "" "Tmpvalue = [] forKey inSorted (Values.items (), key=LambdaD:D[1], reverse=True) [: num]: Tmpvalue.append (Key[keyorvalue]) returnTmpvalue # Load Data Labeled_docs = [(
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.