NLTK Study Notes (ii): text, Corpus resources and WordNet Summary

Source: Internet
Author: User
Tags nltk

[TOC]

Corpus basic Function table
Example Description
Fileids () Files in the Corpus
Fileids ([categories]) Corpus files in corresponding classifications
Categories () Classification of Corpora
Categories ([Fileids]) Corpus Classification for file correspondence
Raw (FILEIDS=[F1,F2..],CATEGORIES=[C1,C2 ...]) The original content in the corresponding file and category. Parameters can be null-type
Words (FILEIDS=[F1,F2..],CATEGORIES=[C1,C2 ...]) The words that correspond to files and classifications. Parameters can be empty
Sents () Sents (Fileids=[f1,f2..],categories=[c1,c2 ...])
Abspath (Fileid) The location of the file on the disk
Encoding (Fileid) Encoding of the file
Open (Fileid) Open File stream
Root () Location of local corpus corpus
Readme () Contents of the Readme file
Text Corpus classification
    1. The simplest is an isolated collection of text
    2. Classify the structure by tags such as text, such as: Brown corpus
    3. A corpus that is not strictly classified and overlapping, such as: Reuters Corpus
    4. Corpus that changes with time/language usage, such as: Inaugural Address Library
Common Corpus and its usage

Note nltk.Text(string) Returns a text object similar to Text1

Fort Gu Teng Corpus

Contains 36000 ebooks, which can be downloaded here

from nltk.corpus import gutenbergprint(gutenberg.fileids())emma= gutenberg.words(‘austen-emma.txt‘)print(gutenberg.raw(‘austen-emma.txt‘))emma = nltk.Text(emma)#print(emma[:10])
Networking && talking about celestial Bodies

Network text is mainly informal literature, forum exchanges, plays, comments and so on. The chat text is divided into 15 large files based on the chat room (the file name includes the date, the chat room, the number of posts).

#网络体:webtextfrom nltk.corpus import webtextfor fileid in webtext.fileids():    print(fileid,webtext.raw(fileid)[:50])
[out]firefox.txt Cookie Manager: "Don‘t allow sites that set removegrail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whooverheard.txt White guy: So, do you have any plans for this evenpirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN‘S CHEST, by Tedsingles.txt 25 SEXY MALE, seeks attrac older single lady, for wine.txt Lovely delicate, fragrant Rhone wine. Polished lea
#聊天体:nps_chatfrom nltk.corpus import nps_chatchatroom = nps_chat.posts(‘10-19-20s_706posts.xml‘)chatroom[123:125]
[out][[‘i‘,  ‘do‘,  "n‘t",  ‘want‘,  ‘hot‘,  ‘pics‘,  ‘of‘,  ‘a‘,  ‘female‘,  ‘,‘,  ‘I‘,  ‘can‘,  ‘look‘,  ‘in‘,  ‘a‘,  ‘mirror‘,  ‘.‘], [‘hi‘, ‘U64‘]]
Brown Corpus

Millions of word-level corpora, nothing to say. Sort by text, such as news, editorials, etc.

from nltk.corpus import brownprint(brown.categories())print(brown.fileids())

Because this corpus is a resource for studying the systematic differences between texts, it is possible to compare the use of modal verbs in different texts.

import nltkfrom nltk.corpus import brownnews = brown.words(categories=‘news‘)fdist = nltk.FreqDist([w.lower() for w in news])modals= [‘can‘,‘could‘,‘may‘,‘might‘,‘must‘,‘will‘]for m in modals:    print(m,‘:‘,fdist[m])
Reuters Corpus

The news document is divided into two groups: "Training" and "testing". Easy to train and test. The name is ' Test/number ' and ' Training/number '

from nltk.corpus import reutersprint(reuters.fileids())print(reuters.categories())
Inaugural Speech Corpus

It feels like an American accent. Because the name is in the ' year-name.txt ' format, we can extract the time dimension and make a line chart to count the frequency of the occurrence of a particular word (in different eras).

from nltk.corpus import inauguralprint(list(f[:4]for f in inaugural.fileids()))#下面体现American和citizen随时间推移使用情况cfd = nltk.ConditionalFreqDist(                              (target,fileid[:4])                              for fileid in inaugural.fileids()                              for w in inaugural.words(fileid)                              for target in [‘america‘,‘citizen‘]                               if w.lower().startswith(target))cfd.plot()

Feel It (attached)

Loading a custom Corpus

If you want to manipulate your own corpus, and use the previous method, then you need a PlaintextCorpusReader function to load them, the function parameter has two, the first is the root directory, the second is a sub-file (you can use regular expressions to match)

from nltk.corpus import PlaintextCorpusReaderroot = r‘C:\Users\Asura-Dong\Desktop\tem\dict‘wordlist = PlaintextCorpusReader(root,‘.*‘)#匹配所有文件print(wordlist.fileids())print(wordlist.words(‘tem1.txt‘))
输出结果:[‘README‘, ‘tem1.txt‘][‘hello‘, ‘world‘]
Dictionary Resources

Dictionary: Includes part of speech and annotation information.

Discontinued Word Corpus

Stopwords that is, unfortunately there is no Chinese stop word

from nltk.corpus import stopwords#定义一个计算func计算不在停用词列表中的比例的函数def content(text):    stopwords_eng = stopwords.words(‘english‘)    content = [w for w in text if w.lower() and w not in stopwords_eng]    return len(content)/len(text)print(content(nltk.corpus.reuters.words()))
Dictionary of Names

Two-part, male and female English names. Here, we'll look at the last name, the last letter, the sex relationship.

names = nltk.corpus.namesprint(names.fileids())male = names.words(‘male.txt‘)female = names.words(‘female.txt‘)cfd = nltk.ConditionalFreqDist((fileid,name[-1]) for fileid in names.fileids() for name in names.words(fileid))cfd.plot()

Attached

Pronunciation Dictionary

This is even more magical and is intended for pronunciation synthesis. After reading through the book later, also think about how to move to Chinese.

nltk.corpus.cmudictAfter introduction, we can get the length of its phonemes, so that rhyming words can be found.

s = [‘N‘,‘IHO‘,‘K‘,‘S‘]entries = nltk.corpus.cmudict.entries()print(‘Example:‘,entries[0])word_list = [word for word,pron in entries if pron[-4:]==s]print(word_list)

In the factor table, we will find the numbers: 1,2,0. They represent: main stress, secondary accent, no accent.
Here we can define a function to find words with a particular accent pattern .

def func(pron):    return [char for phone in pron for char in phone if char.isdigit()]word_list = [w for w,pron in entries if func(pron)==[‘0‘,‘1‘,‘0‘,‘2‘,‘0‘]]print(word_list)
WordNet semantic-oriented English dictionary

The dictionary must be said at last. WordNet is a cognitive linguistics-based English dictionary designed by psychologists, linguists and computer engineers at Princeton University. It is not the light that arranges the words in alphabetical order and makes up a "network of words" in terms of the meaning of the words.

Introduction and synonyms

Motorcar and automobile are synonyms, which can be studied with the help of WordNet.

from nltk.corpus import wordnet as wnwn.synsets(‘motorcar‘)

The result is: [Synset (' car.n.01 ')]. Description Motorcar has only one possible meaning. car.n.01 is referred to as "synonymous Word set ". We can wn.synset(‘car.n.01‘).lemma_names look at the other words in the current set of synonyms (car has a lot of synonyms on the word). wn.synset(‘car.n.01‘).examplesand the wn.synset(‘car.n.01‘).definition definitions and examples can be viewed separately (but Python3 inside cannot. )

and similar to Car.n.01.car in the next level called the term .
For the entry-level obj, you can see the following action.

print(wn.synset(‘car.n.01‘).lemmas)wn.lemma(‘car.n.01.automobile‘).namewn.lemma(‘car.n.01.automobile‘).synset
Upper words, inferior words, antonyms

The upper term (hypernym), refers to the conceptual extension of the topic of a wider range of words. For example: "Flower" is "flowers" the upper word, "plant" is "flower" the upper word, "Music" is "MP3" the upper word. The converse is the inferior word.

The upper and lower words are hyponyms() accessed through and through root_hypernyms() .

motorcar = wn.synset(‘car.n.01‘).hyponyms()#下位词car = wn.synset(‘car.n.01‘).root_hypernyms()

Antonyms are antonyms() accessed by

Other word Set relationships

Before, from top to bottom, or vice versa. It is more important to be local, or vice versa, from the whole. such as the tree and Crown, the relationship between the trunk, these are part_meronyms() . And the tree is a forest member_holonyms() . The essence of the tree is the heartwood and sapwood, namely substance_meronyms() .

Semantic similarity

When two words have the same upper words (looking for in the word tree), and if the upper words happen to belong to the lower layer, then they will have a close connection.

right = wn.synset(‘right_whale.n.01‘)orca = wn.synset(‘orca.n.01‘)print(right.lowest_common_hypernyms(orca))

Of course, a tree-like structure is always divine and can be min_depth() viewed by looking at the minimum depth of a synset. Based on these, we can return the similarity within the range of 0-1. For the above code, look at the similarity: right.path_similarity(orca) .

These numbers are small in significance. But when whales and whales, whales and novels compare, the numbers are reduced. It's also meaningful to look at the size of the comparison.

NLTK Study Notes (ii): text, Corpus resources and WordNet Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.