1. Get a text corpus
The NLTK library contains a large number of corpora, which are described in the following sections:
(1) Gutenberg Corpus: NLTK contains a small portion of the text of the Gutenberg project's electronic text file. The project currently has about 36000 free e-books.
>>>import nltk>>>nltk.corpus.gutenberg.fileids () [' Austen-emma.txt ', ' austen-persuasion.txt ' Austen-sense.txt ', ' bible-kjv.txt ', ' blake-poems.txt ', ' bryant-stories.txt ', ' burgess-busterbrown.tx ' Carroll-alice.txt ', ' chesterton-ball.txt ', ' chesterton-brown.txt ', ' chesterton-thursday.tx ' edgeworth-parents.txt ' ' Melville-moby_dick.txt ' milton-paradise.txt ', ' shakespeare-caesar.txt, ' shakespeare-hamlet.txt, ' Shakespeare-macbeth.txt ' Whitman-leaves.txt ']
Use: From Nltk.corpus Import Gutenberg
Write a short program that iterates through the fileid of the Gutenberg stylistic identifiers listed earlier, and then counts each text:
import nltkfrom nltk.corpus import gutenbergfor Fileid in Gutenberg.fileids (): Num_chars=len (Gutenberg.raw (Fileid)) # # #统计字符数num_words =len (Gutenberg.words (Fileid)) # #统计单词书num_sent =len ( Gutenberg.sents (Fileid) # # #统计句子数num_vocab =len (Set ([W.lower () for W Gutenberg.words (Fileid)])) # # #唯一化单词print (int ( num_chars/num_words), int (num_words/num_sent), int (num_words/num_vocab), Fileid)
Results: 4 austen-emma.txt
4 Austen-persuasion.txt
4 Austen-sense.txt
4 Bible-kjv.txt
4 5 Blake-poems.txt
4 Bryant-stories.txt
4 Burgess-busterbrown.txt
4 Carroll-alice.txt
4 Chesterton-ball.txt
4 Chesterton-brown.txt
4 Chesterton-thursday.txt
4 Edgeworth-parents.txt
4 Melville-moby_dick.txt
4 Milton-paradise.txt
4 8 Shakespeare-caesar.txt
4 7 Shakespeare-hamlet.txt
4 6 Shakespeare-macbeth.txt
4 Whitman-leaves.txt
This result shows 3 statistics for each text: the length of the draw, the average sentence duration, and the average number of occurrences of each word in the text.
(2) Network and chat text:
This part represents an informal language.
Natural language Processing--NLTK Text corpus