[TOC]
Corpus basic Function table
Example |
Description |
Fileids () |
Files in the Corpus |
Fileids ([categories]) |
Corpus files in corresponding classifications |
Categories () |
Classification of Corpora |
Categories ([Fileids]) |
Corpus Classification for file correspondence |
Raw (FILEIDS=[F1,F2..],CATEGORIES=[C1,C2 ...]) |
The original content in the corresponding file and category. Parameters can be null-type |
Words (FILEIDS=[F1,F2..],CATEGORIES=[C1,C2 ...]) |
The words that correspond to files and classifications. Parameters can be empty |
Sents () |
Sents (Fileids=[f1,f2..],categories=[c1,c2 ...]) |
Abspath (Fileid) |
The location of the file on the disk |
Encoding (Fileid) |
Encoding of the file |
Open (Fileid) |
Open File stream |
Root () |
Location of local corpus corpus |
Readme () |
Contents of the Readme file |
Text Corpus classification
- The simplest is an isolated collection of text
- Classify the structure by tags such as text, such as: Brown corpus
- A corpus that is not strictly classified and overlapping, such as: Reuters Corpus
- Corpus that changes with time/language usage, such as: Inaugural Address Library
Common Corpus and its usage
Note nltk.Text(string)
Returns a text object similar to Text1
Fort Gu Teng Corpus
Contains 36000 ebooks, which can be downloaded here
from nltk.corpus import gutenbergprint(gutenberg.fileids())emma= gutenberg.words(‘austen-emma.txt‘)print(gutenberg.raw(‘austen-emma.txt‘))emma = nltk.Text(emma)#print(emma[:10])
Networking && talking about celestial Bodies
Network text is mainly informal literature, forum exchanges, plays, comments and so on. The chat text is divided into 15 large files based on the chat room (the file name includes the date, the chat room, the number of posts).
#网络体:webtextfrom nltk.corpus import webtextfor fileid in webtext.fileids(): print(fileid,webtext.raw(fileid)[:50])
[out]firefox.txt Cookie Manager: "Don‘t allow sites that set removegrail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whooverheard.txt White guy: So, do you have any plans for this evenpirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN‘S CHEST, by Tedsingles.txt 25 SEXY MALE, seeks attrac older single lady, for wine.txt Lovely delicate, fragrant Rhone wine. Polished lea
#聊天体:nps_chatfrom nltk.corpus import nps_chatchatroom = nps_chat.posts(‘10-19-20s_706posts.xml‘)chatroom[123:125]
[out][[‘i‘, ‘do‘, "n‘t", ‘want‘, ‘hot‘, ‘pics‘, ‘of‘, ‘a‘, ‘female‘, ‘,‘, ‘I‘, ‘can‘, ‘look‘, ‘in‘, ‘a‘, ‘mirror‘, ‘.‘], [‘hi‘, ‘U64‘]]
Brown Corpus
Millions of word-level corpora, nothing to say. Sort by text, such as news, editorials, etc.
from nltk.corpus import brownprint(brown.categories())print(brown.fileids())
Because this corpus is a resource for studying the systematic differences between texts, it is possible to compare the use of modal verbs in different texts.
import nltkfrom nltk.corpus import brownnews = brown.words(categories=‘news‘)fdist = nltk.FreqDist([w.lower() for w in news])modals= [‘can‘,‘could‘,‘may‘,‘might‘,‘must‘,‘will‘]for m in modals: print(m,‘:‘,fdist[m])
Reuters Corpus
The news document is divided into two groups: "Training" and "testing". Easy to train and test. The name is ' Test/number ' and ' Training/number '
from nltk.corpus import reutersprint(reuters.fileids())print(reuters.categories())
Inaugural Speech Corpus
It feels like an American accent. Because the name is in the ' year-name.txt ' format, we can extract the time dimension and make a line chart to count the frequency of the occurrence of a particular word (in different eras).
from nltk.corpus import inauguralprint(list(f[:4]for f in inaugural.fileids()))#下面体现American和citizen随时间推移使用情况cfd = nltk.ConditionalFreqDist( (target,fileid[:4]) for fileid in inaugural.fileids() for w in inaugural.words(fileid) for target in [‘america‘,‘citizen‘] if w.lower().startswith(target))cfd.plot()
Feel It (attached)
Loading a custom Corpus
If you want to manipulate your own corpus, and use the previous method, then you need a PlaintextCorpusReader
function to load them, the function parameter has two, the first is the root directory, the second is a sub-file (you can use regular expressions to match)
from nltk.corpus import PlaintextCorpusReaderroot = r‘C:\Users\Asura-Dong\Desktop\tem\dict‘wordlist = PlaintextCorpusReader(root,‘.*‘)#匹配所有文件print(wordlist.fileids())print(wordlist.words(‘tem1.txt‘))
输出结果:[‘README‘, ‘tem1.txt‘][‘hello‘, ‘world‘]
Dictionary Resources
Dictionary: Includes part of speech and annotation information.
Discontinued Word Corpus
Stopwords that is, unfortunately there is no Chinese stop word
from nltk.corpus import stopwords#定义一个计算func计算不在停用词列表中的比例的函数def content(text): stopwords_eng = stopwords.words(‘english‘) content = [w for w in text if w.lower() and w not in stopwords_eng] return len(content)/len(text)print(content(nltk.corpus.reuters.words()))
Dictionary of Names
Two-part, male and female English names. Here, we'll look at the last name, the last letter, the sex relationship.
names = nltk.corpus.namesprint(names.fileids())male = names.words(‘male.txt‘)female = names.words(‘female.txt‘)cfd = nltk.ConditionalFreqDist((fileid,name[-1]) for fileid in names.fileids() for name in names.words(fileid))cfd.plot()
Attached
Pronunciation Dictionary
This is even more magical and is intended for pronunciation synthesis. After reading through the book later, also think about how to move to Chinese.
nltk.corpus.cmudict
After introduction, we can get the length of its phonemes, so that rhyming words can be found.
s = [‘N‘,‘IHO‘,‘K‘,‘S‘]entries = nltk.corpus.cmudict.entries()print(‘Example:‘,entries[0])word_list = [word for word,pron in entries if pron[-4:]==s]print(word_list)
In the factor table, we will find the numbers: 1,2,0. They represent: main stress, secondary accent, no accent.
Here we can define a function to find words with a particular accent pattern .
def func(pron): return [char for phone in pron for char in phone if char.isdigit()]word_list = [w for w,pron in entries if func(pron)==[‘0‘,‘1‘,‘0‘,‘2‘,‘0‘]]print(word_list)
WordNet semantic-oriented English dictionary
The dictionary must be said at last. WordNet is a cognitive linguistics-based English dictionary designed by psychologists, linguists and computer engineers at Princeton University. It is not the light that arranges the words in alphabetical order and makes up a "network of words" in terms of the meaning of the words.
Introduction and synonyms
Motorcar and automobile are synonyms, which can be studied with the help of WordNet.
from nltk.corpus import wordnet as wnwn.synsets(‘motorcar‘)
The result is: [Synset (' car.n.01 ')]. Description Motorcar has only one possible meaning. car.n.01 is referred to as "synonymous Word set ". We can wn.synset(‘car.n.01‘).lemma_names
look at the other words in the current set of synonyms (car has a lot of synonyms on the word). wn.synset(‘car.n.01‘).examples
and the wn.synset(‘car.n.01‘).definition
definitions and examples can be viewed separately (but Python3 inside cannot. )
and similar to Car.n.01.car in the next level called the term .
For the entry-level obj, you can see the following action.
print(wn.synset(‘car.n.01‘).lemmas)wn.lemma(‘car.n.01.automobile‘).namewn.lemma(‘car.n.01.automobile‘).synset
Upper words, inferior words, antonyms
The upper term (hypernym), refers to the conceptual extension of the topic of a wider range of words. For example: "Flower" is "flowers" the upper word, "plant" is "flower" the upper word, "Music" is "MP3" the upper word. The converse is the inferior word.
The upper and lower words are hyponyms()
accessed through and through root_hypernyms()
.
motorcar = wn.synset(‘car.n.01‘).hyponyms()#下位词car = wn.synset(‘car.n.01‘).root_hypernyms()
Antonyms are antonyms()
accessed by
Other word Set relationships
Before, from top to bottom, or vice versa. It is more important to be local, or vice versa, from the whole. such as the tree and Crown, the relationship between the trunk, these are part_meronyms()
. And the tree is a forest member_holonyms()
. The essence of the tree is the heartwood and sapwood, namely substance_meronyms()
.
Semantic similarity
When two words have the same upper words (looking for in the word tree), and if the upper words happen to belong to the lower layer, then they will have a close connection.
right = wn.synset(‘right_whale.n.01‘)orca = wn.synset(‘orca.n.01‘)print(right.lowest_common_hypernyms(orca))
Of course, a tree-like structure is always divine and can be min_depth()
viewed by looking at the minimum depth of a synset. Based on these, we can return the similarity within the range of 0-1. For the above code, look at the similarity: right.path_similarity(orca)
.
These numbers are small in significance. But when whales and whales, whales and novels compare, the numbers are reduced. It's also meaningful to look at the size of the comparison.
NLTK Study Notes (ii): text, Corpus resources and WordNet Summary