Python Data Analysis Learning notes Nine

Last Update:2018-07-24 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Chapter Nineth Analysis of text data and social media

1 Installation NLTK slightly

2 Filter Stop word name and number

The sample code is as follows:

ImportNLTK # Load English stop word corpus SW = set (Nltk.corpus.stopwords.words (' 中文版 ')) print (' Stop words ', list (sw) [: 7]) # Get the part of the Gutenberg Corpus
File GB = Nltk.corpus.gutenberg print (' Gutenberg files ', gb.fileids () [-5:]) # Take the first two sentences in the Milton-paradise.txt file as the filter statement used below Text_sent = gb.sents ("Milton-paradise.txt") [: 2] Print (' unfiltered ', text_sent) # filter Stop Word forSent intext_sent:filtered = [w forW inSentifW.lower () not inSW] Print (' filtered ', filtered) # Gets the label contained in the text tagged = nltk.pos_tag (filtered) print ("Tagged", tagged) words = [] forWord inTaggedifWord[1]!= ' NNP ' andWord[1]!= ' CD: Words.append (word[0]) print (words) # pos Callout Set # print (nltk.tag.tagset_mapping (' RU-RNC ', ' UN Iversal '))

The results of the operation are as follows:

Connected to Pydev Debugger (build162.1967.10)

Stop words [' he ', ' only ', ' because ', ' and ', ' each ', ' myself ', ' both ']

Gutenberg files [' Milton-paradise.txt ', ' shakespeare-caesar.txt ', ' shakespeare-hamlet.txt ', ' Shakespeare-macbeth.txt ', ' whitman-leaves.txt ']

unfiltered [[' [], ' Paradise ', ' Lost ', ' by ', ' John ', ' Milton ', ' 1667 ', '] ', [' book ', ' I ']]

filtered [' [', ' Paradise ', ' Lost ', ' John ', ' Milton ', ' 1667 ', '] '

Tagged [[', ' JJ '], (' Paradise ', ' NNP '), (' Lost ', ' NNP '), (' John ', ' NNP '), (' Milton ', ' NNP '), (' 1667 ', ' CD '), ('] ', ' NN ')]

['[', ']']

Filtered [' book ']

Tagged [(' Book ', ' NN ')]

[' book ']

The set of tags used in this example:

{' prp$ ',

' PDT ',

' CD ',

' EX ',

'.',

' NNS ',

' MD ',

' PRP ',

' RP ',

'(',

' VBD ',

'``',

"''",

' NN ', noun

' LS ',

' VBN ',

' Wrb ',

' In ', prepositions

' FW ',

' POS ',

' CC ', and conjunctions

':',

' DT ',

' Vbz ',

' RBS ',

' RBR ',

' wp$ ',

' RB ',

' SYM ',

' JJS ',

' Jjr ',

' UH ',

' WDT ',

'#',

',',

')',

' VB ',

' Nnps ',

' VBP ', verb

' NNP ',

' JJ ', adjective

' WP ',

' VBG ',

'$',

' to '} Word to

Roughly divided into the following 12 types of

' VERB ',

' NOUN ',

' PRON ',

' ADJ ',

' ADV ',

' ADP ',

' Conj ',

' DET ',

' NUM ',

' PRT ',

' X ',

'.'

3 Word Bag Model

Install Scikit-learn slightly

The sample code is as follows:


nltkcountvectorizer

# Loading the following two files from the Gutenberg corpus
GB = Nltk.corpus.gutenberg
Hamlet = Gb.raw (' shakespeare-hamlet.txt ')
Macbeth = Gb.raw ("Shakespeare-macbeth.txt")

# Remove English stop word
CV = Countvectorizer (stop_words= ' 中文版 ')
# Output part of the feature value
print ("Feature vector", Cv.fit_transform ([Hamlet, Macbeth]. ToArray ())
# The eigenvalues are sorted in alphabetical order
print (' Features ', Cv.get_feature_names () [: 5])

The results of the operation are as follows:

Feature vector [[1 0 1 ..., 14 0 1]

[0 1 0 ..., 1 1 0]]

Features [' 1599 ', ' 1603 ', ' abhominably ', ' abhorred ', ' abide ']

4 Frequency analysis

The sample code is as follows:

def PrintLine(values, num, Keyorvalue, tag): "" "" "to print the specified list of the NUM element key or value, the output tag is tag:p AramValues: List:p AramNum: Number of output elements:p AramKeyorvalue: The key of the output element or the value 0 is the key, 1 represents the value:p AramTag: Output label: return: "" "Tmpvalue = [] forKey inSorted (Values.items (), key=LambdaD:D[1], reverse=True) [: num]: Tmpvalue.append (Key[keyorvalue]) print (tag, ":", Tmpvalue) # Load Document GB = Nltk.corpus.gutenberg words = Gb.words ("Shakespeare-caesar.txt") # Support for deactivating words and punctuation marks SW = set (Nltk.corpus.stopwords.words (' 中文版 ')) punctuation = set ( string.punctuation) filtered = [W.lower () forW inWordsifW.lower () not inSw andW.lower () not inPunctuation] # Create Freqdist object with the highest output frequency key and value FD = NLTK. Freqdist (filtered) PrintLine (FD, 5, 0, "Wrods") PrintLine (FD, 5, 1, "Counts") # Most frequently used words and times print (' Max ', Fd.max ()) print (' Count ', fd[' Caesar ']) # The most common double word and word number FD = NLTK. Freqdist (Nltk.bigrams (filtered)) PrintLine (FD, 5, 0, "Bigrams") PrintLine (FD, 5, 1, "Counts") print (' Bigram Max ', Fd.max ( ) Print (' Bigram count ', fd[(' Let ', ' vs ')]) # The three words and the number of words most commonly appear fd = NLTK. Freqdist (Nltk.trigrams (filtered)) PrintLine (FD, 5, 0, "Trigrams") PrintLine (FD, 5, 1, "Counts") print (' Bigram Max ', fd.ma X ()) print (' Bigram count ', fd[(' Enter ', ' Lucius ', ' Luc ')]

The results of the operation are as follows:

Wrods: [' Caesar ', ' Brutus ', ' Bru ', ' haue ', ' shall ']

Counts: [190, 161, 153, 148, 125]

Max Caesar

Count 190

Bigrams: [' Let ', ' vs '], (' Wee ', ' l '), (' Mark ', ' Antony '), (' Marke ', ' Antony '), (' St ', ' Thou ')]

Counts: [16, 15, 13, 12, 12]

Bigram Max (' Let ', ' vs ')

Bigram Count 16

Trigrams: [(' Enter ', ' Lucius ', ' Luc '), (' Wee ', ' l ', ' Heare '), (' Thee ', ' thou ', ' st '), (' Beware ', ' Ides ', ' March '), (' Let ', ' vs ', ' heare ')]

Counts: [4, 4, 3, 3, 3]

Bigram Max (' Enter ', ' Lucius ', ' Luc ')

Bigram Count 4

5 naive Bayesian classification

is a probabilistic algorithm, based on the Bayesian theorem in probability and mathematical statistics

The sample code is as follows:

ImportNltkImportStringImportRandom # Deactivate words and punctuation sets SW = Set (Nltk.corpus.stopwords.words (' 中文版 ')) punctuation = set (string.punctuation) # takes word length as a featuredef word_features(word): return{' Len ': Len (Word)} # is either a deactivated word or a punctuation markdef Isstopword(word): returnWord inSworWord inPunctuation # Load File GB = Nltk.corpus.gutenberg words = gb.words ("Shakespeare-caesar.txt") # label words to distinguish whether or not to deactivate a word labeled_wor ds = ([(Word.lower (), Isstopword (Word.lower ())) forWord inWords] random.seed () random.shuffle (labeled_words) print (Labeled_words[:5)) # to find the length of each word as a characteristic value featuresets = [(Word_ Features (n), word) for(n, Word) inLabeled_words] # Training a naïve Bayesian classifier cutoff = Int (. 9 * len (featuresets)) # Create training datasets and test datasets Train_set, Test_set = Featuresets[:cut Off], Featuresets[cutoff:] # Check the classifier effect classifier = NLTK. Naivebayesclassifier.train (train_set) print ("' Behold ' class", Classifier.classify (word_features (' behold '))) print ( ' The ' class ', Classifier.classify (Word_features (' the ')) # calculates the classifier's accuracy according to the test dataset print ("Accuracy", Nltk.classify.accuracy (classifier, test_set)) # Maximum contribution feature print (Classifier.show_most_informative_features (5))

The results of the operation are as follows:

[(' I ', true), (' is ', true), (' by ', true), (' he ', true), (' ambitious ', False)]

' Behold ' class False

' The ' class True

Accuracy 0.8521671826625387

Most informative Features

Len = 7 False:true = 77.8:1

Len = 6 False:true = 52.2:1

Len = 1 True:false = 51.8:1

Len = 2 True:false = 10.9:1

Len = 5 False:true = 10.9:1

None

6 Affective Analysis

The sample code is as follows:

ImportRandom fromNltk.corpusImportMovie_reviews fromNltk.corpusImportStopwords fromNltkImportFreqdist fromNltkImportNaivebayesclassifier fromNltk.classifyImportAccuracyImportStringdef Getelementsbynum(values, num, keyorvalue): "" Gets the key or value of the NUM element of the specified list,:p AramValues: List:p AramNum: Number of elements:p AramKeyorvalue: Key for element or value 0 for key, 1 for value: return: "" "Tmpvalue = [] forKey inSorted (Values.items (), key=LambdaD:D[1], reverse=True) [: num]: Tmpvalue.append (Key[keyorvalue]) returnTmpvalue # Load Data Labeled_docs = [(

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More