International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Web Develop

NLTK Study Notes (ii): text, Corpus resources and WordNet Summary

Last Update:2017-06-13 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[TOC]

Corpus basic Function table

Example	Description
Fileids ()	Files in the Corpus
Fileids ([categories])	Corpus files in corresponding classifications
Categories ()	Classification of Corpora
Categories ([Fileids])	Corpus Classification for file correspondence
Raw (FILEIDS=[F1,F2..],CATEGORIES=[C1,C2 ...])	The original content in the corresponding file and category. Parameters can be null-type
Words (FILEIDS=[F1,F2..],CATEGORIES=[C1,C2 ...])	The words that correspond to files and classifications. Parameters can be empty
Sents ()	Sents (Fileids=[f1,f2..],categories=[c1,c2 ...])
Abspath (Fileid)	The location of the file on the disk
Encoding (Fileid)	Encoding of the file
Open (Fileid)	Open File stream
Root ()	Location of local corpus corpus
Readme ()	Contents of the Readme file

Text Corpus classification

The simplest is an isolated collection of text
Classify the structure by tags such as text, such as: Brown corpus
A corpus that is not strictly classified and overlapping, such as: Reuters Corpus
Corpus that changes with time/language usage, such as: Inaugural Address Library

Common Corpus and its usage

Note nltk.Text(string) Returns a text object similar to Text1

Fort Gu Teng Corpus

Contains 36000 ebooks, which can be downloaded here

from nltk.corpus import gutenbergprint(gutenberg.fileids())emma= gutenberg.words(‘austen-emma.txt‘)print(gutenberg.raw(‘austen-emma.txt‘))emma = nltk.Text(emma)#print(emma[:10])

Networking && talking about celestial Bodies

Network text is mainly informal literature, forum exchanges, plays, comments and so on. The chat text is divided into 15 large files based on the chat room (the file name includes the date, the chat room, the number of posts).

#网络体：webtextfrom nltk.corpus import webtextfor fileid in webtext.fileids():    print(fileid,webtext.raw(fileid)[:50])

[out]firefox.txt Cookie Manager: "Don‘t allow sites that set removegrail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whooverheard.txt White guy: So, do you have any plans for this evenpirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN‘S CHEST, by Tedsingles.txt 25 SEXY MALE, seeks attrac older single lady, for wine.txt Lovely delicate, fragrant Rhone wine. Polished lea

#聊天体:nps_chatfrom nltk.corpus import nps_chatchatroom = nps_chat.posts(‘10-19-20s_706posts.xml‘)chatroom[123:125]

[out][[‘i‘,  ‘do‘,  "n‘t",  ‘want‘,  ‘hot‘,  ‘pics‘,  ‘of‘,  ‘a‘,  ‘female‘,  ‘,‘,  ‘I‘,  ‘can‘,  ‘look‘,  ‘in‘,  ‘a‘,  ‘mirror‘,  ‘.‘], [‘hi‘, ‘U64‘]]

Brown Corpus

Millions of word-level corpora, nothing to say. Sort by text, such as news, editorials, etc.

from nltk.corpus import brownprint(brown.categories())print(brown.fileids())

Because this corpus is a resource for studying the systematic differences between texts, it is possible to compare the use of modal verbs in different texts.

import nltkfrom nltk.corpus import brownnews = brown.words(categories=‘news‘)fdist = nltk.FreqDist([w.lower() for w in news])modals= [‘can‘,‘could‘,‘may‘,‘might‘,‘must‘,‘will‘]for m in modals:    print(m,‘:‘,fdist[m])

Reuters Corpus

The news document is divided into two groups: "Training" and "testing". Easy to train and test. The name is ' Test/number ' and ' Training/number '

from nltk.corpus import reutersprint(reuters.fileids())print(reuters.categories())

Inaugural Speech Corpus

It feels like an American accent. Because the name is in the ' year-name.txt ' format, we can extract the time dimension and make a line chart to count the frequency of the occurrence of a particular word (in different eras).

from nltk.corpus import inauguralprint(list(f[:4]for f in inaugural.fileids()))#下面体现American和citizen随时间推移使用情况cfd = nltk.ConditionalFreqDist(                              (target,fileid[:4])                              for fileid in inaugural.fileids()                              for w in inaugural.words(fileid)                              for target in [‘america‘,‘citizen‘]                               if w.lower().startswith(target))cfd.plot()

Feel It (attached)

Loading a custom Corpus

If you want to manipulate your own corpus, and use the previous method, then you need a PlaintextCorpusReader function to load them, the function parameter has two, the first is the root directory, the second is a sub-file (you can use regular expressions to match)

from nltk.corpus import PlaintextCorpusReaderroot = r‘C:\Users\Asura-Dong\Desktop\tem\dict‘wordlist = PlaintextCorpusReader(root,‘.*‘)#匹配所有文件print(wordlist.fileids())print(wordlist.words(‘tem1.txt‘))

输出结果：[‘README‘, ‘tem1.txt‘][‘hello‘, ‘world‘]

Dictionary Resources

Dictionary: Includes part of speech and annotation information.

Discontinued Word Corpus

Stopwords that is, unfortunately there is no Chinese stop word

from nltk.corpus import stopwords#定义一个计算func计算不在停用词列表中的比例的函数def content(text):    stopwords_eng = stopwords.words(‘english‘)    content = [w for w in text if w.lower() and w not in stopwords_eng]    return len(content)/len(text)print(content(nltk.corpus.reuters.words()))

Dictionary of Names

Two-part, male and female English names. Here, we'll look at the last name, the last letter, the sex relationship.

names = nltk.corpus.namesprint(names.fileids())male = names.words(‘male.txt‘)female = names.words(‘female.txt‘)cfd = nltk.ConditionalFreqDist((fileid,name[-1]) for fileid in names.fileids() for name in names.words(fileid))cfd.plot()

Attached

Pronunciation Dictionary

This is even more magical and is intended for pronunciation synthesis. After reading through the book later, also think about how to move to Chinese.

nltk.corpus.cmudictAfter introduction, we can get the length of its phonemes, so that rhyming words can be found.

s = [‘N‘,‘IHO‘,‘K‘,‘S‘]entries = nltk.corpus.cmudict.entries()print(‘Example:‘,entries[0])word_list = [word for word,pron in entries if pron[-4:]==s]print(word_list)

In the factor table, we will find the numbers: 1,2,0. They represent: main stress, secondary accent, no accent.
Here we can define a function to find words with a particular accent pattern .

def func(pron):    return [char for phone in pron for char in phone if char.isdigit()]word_list = [w for w,pron in entries if func(pron)==[‘0‘,‘1‘,‘0‘,‘2‘,‘0‘]]print(word_list)

WordNet semantic-oriented English dictionary

The dictionary must be said at last. WordNet is a cognitive linguistics-based English dictionary designed by psychologists, linguists and computer engineers at Princeton University. It is not the light that arranges the words in alphabetical order and makes up a "network of words" in terms of the meaning of the words.

Introduction and synonyms

Motorcar and automobile are synonyms, which can be studied with the help of WordNet.

from nltk.corpus import wordnet as wnwn.synsets(‘motorcar‘)

The result is: [Synset (' car.n.01 ')]. Description Motorcar has only one possible meaning. car.n.01 is referred to as "synonymous Word set ". We can wn.synset(‘car.n.01‘).lemma_names look at the other words in the current set of synonyms (car has a lot of synonyms on the word). wn.synset(‘car.n.01‘).examplesand the wn.synset(‘car.n.01‘).definition definitions and examples can be viewed separately (but Python3 inside cannot. ）

and similar to Car.n.01.car in the next level called the term .
For the entry-level obj, you can see the following action.

print(wn.synset(‘car.n.01‘).lemmas)wn.lemma(‘car.n.01.automobile‘).namewn.lemma(‘car.n.01.automobile‘).synset

Upper words, inferior words, antonyms

The upper term (hypernym), refers to the conceptual extension of the topic of a wider range of words. For example: "Flower" is "flowers" the upper word, "plant" is "flower" the upper word, "Music" is "MP3" the upper word. The converse is the inferior word.

The upper and lower words are hyponyms() accessed through and through root_hypernyms() .

motorcar = wn.synset(‘car.n.01‘).hyponyms()#下位词car = wn.synset(‘car.n.01‘).root_hypernyms()

Antonyms are antonyms() accessed by

Other word Set relationships

Before, from top to bottom, or vice versa. It is more important to be local, or vice versa, from the whole. such as the tree and Crown, the relationship between the trunk, these are part_meronyms() . And the tree is a forest member_holonyms() . The essence of the tree is the heartwood and sapwood, namely substance_meronyms() .

Semantic similarity

When two words have the same upper words (looking for in the word tree), and if the upper words happen to belong to the lower layer, then they will have a close connection.

right = wn.synset(‘right_whale.n.01‘)orca = wn.synset(‘orca.n.01‘)print(right.lowest_common_hypernyms(orca))

Of course, a tree-like structure is always divine and can be min_depth() viewed by looking at the minimum depth of a synset. Based on these, we can return the similarity within the range of 0-1. For the above code, look at the similarity: right.path_similarity(orca) .

These numbers are small in significance. But when whales and whales, whales and novels compare, the numbers are reduced. It's also meaningful to look at the size of the comparison.

NLTK Study Notes (ii): text, Corpus resources and WordNet Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

pmp study notes shadow and claw summary set and service resources handwritten notes to text ipad convert handwritten notes to text resources wordnet synset

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

NLTK Study Notes (ii): text, Corpus resources and WordNet Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support