Natural Language Processing 2.3--dictionary resources

Last Update:2016-09-27 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A dictionary or dictionary resource is a collection of words and/or phrases and their associated information, such as the definition of part of speech and the meaning of a word. Dictionary resources are subordinate to text and are created and enriched by text. For example, a text my_text is defined, then a my_text vocabulary is established through vocab=sorted (set (My_text)), and Word_freq=freqdist (My_text) is used to count the frequency of each word in the text. Vocab and Word_freq are simple vocabulary resources.

"term" includes the word (entry) and other additional information. For example: Part of speech and word meaning

1. Vocabulary List Corpus

The 1.1NLTK includes a corpus that contains only a list of words. We can use it to check the unexpected or misspelled words in the text.

Example: Filter text: This program can calculate the text of the glossary, and then delete all the present list of words appear in the element, the value of leaving a rare or misspelled words.

def unusual_words (text): Text_vocab=set (W.lower () for W in text in W.isalpha ()) English_vocab=set (W.lower () for W in NLTK.C Orpus.words.words ())  unusual=text_vocab.difference (English_vocab) return sorted (unusual) >>>unusual_ Words (nltk.corpus.gutenberg.words (' Austen-sense.txt '))

Output results: [' abbeyland ', ' abhorred ', ' abilities ', ' abounded ', ' abridgement ', ' abused ', ' abuses ', ' accents ', ' accepting ', ' Accommodations ', ' accompanied ', ' accounted ', ' accounts ', ... There are about 1600 of them.

1.2 There is also a stop word corpus, the so-called stop word refers to high-frequency vocabulary, such as the,a,and and so on. Sometimes they need to be filtered out before further processing.

>>>fromimport  stopwords>>>stopwords=stopwords.words ('  中文版')

[' I ', ' Me ', ' my ', ' myself ', ' we ', ' our ', ' ours ', ' ourselves ', ' I ', ' your ', ' yours ', ' yourself ', ' yourselves ', ' he ', ' him ' ' He ', ' himself ', ' she ', ' her ', ' hers ', ' herself ', ' it ', ' it ', ' itself ', ' they ', ' them ', ' their ', ' theirs ', ' themselve ' S ', ' What ', ' the which ', ' who ', ' whom ', ' this ', ' so ', ' these ', ' those ', ' am ', ' is ', ' being ', ' was ', ' were ', ' being ', ' been ', ' bei ' Ng ', ' had ', ' has ', ' had ', ' had ', ' do ', ' does ', ' does ', ' doing ', ' a ', ' an ', ' the ', ' and ', ' but ', ' if ', ' or ', ' because ', ' As ', ' until ', ' while ', ' of ', ' in ', ' by ', ' to ', ' with ', ' on ', ' against ', ' between ', ' into ', ' through ', ' during ', ' Befo ' Re ', ' after ', ' above ', ' below ', ' to ', ' from ', ' Up ', ' down ', ' in ', ' off ', ' on ', ' off ', ' over ', ' under ', ' again ', ' further ', ' Then ', ' once ', ' here ', ' there ', ' What ', ' where ', ' why ', ' What ', ' all ', ' no ', ' both ', ' each ', ' few ', ' more ', ' the most ', ' oth Er ', ' some ', ' such ', ' no ', ' nor ', ' not ', ' only ', ' own ', ' same ', ' so ', ' than ', ' too ', ' very ', ' s ', ' t ', ' can ', ' would ', ' jus T ', ' don ', ' should ', ' noW ', ' d ', ' ll ', ' m ', ' o ', ' re ', ' ve ', ' Y ', ' ain ', ' aren ', ' couldn ', ' didn ', ' doesn ', ' hadn ', ' hasn ', ' haven ', ' isn ', ' ma ', ' Mightn ', ' mustn ', ' needn ', ' Shan ', ' shouldn ', ' wasn ', ' weren ', ' won ', ' wouldn '
You can define a function to calculate the percentage of words in the text that are not included in the list of inactive words:

From Nltk.corpus import stopwordsdef content_fraction (text): Spwords=stopwords.words (' 中文版 ') content=[w for W in text If W.lower () not in Spwords]return Len (content)/len (text) >>>print (Content_fraction ( Nltk.corpus.reuters.words ()) 0.735240435097661

It can be seen that the discontinued words account for nearly 1/3 of the words.

Word puzzle questions: such as:

A list of words is useful for solving the puzzle problem shown above. Run the program to iterate through each word and check whether each word meets the criteria. Checking the letter and length restrictions that must occur is simple, but it is tricky to specify that some letters appear two times (V). Using the Freqdist comparison method, you can check the frequency relationship of each letter in a candidate term.

def puzzle (text):p UZZLE_LETTER=NLTK. Freqdist (text) obligatory= ' R ' wordlist=nltk.corpus.words.words () res=[w for W in Wordlist If Len (w) >=6 and obligatory In W and NLTK. Freqdist (W) <=puzzle_letter]print (res) puzzle (' egivrvonl ')

The results are: [' Glover ', ' Gorlin ', ' govern ', ' grovel ', ' ignore ', ' involver ', ' lienor ', ' linger ', ' longer ', ' lovering ', ' noiler ', ' O ' Verling ', ' region ', ' renvoi ', ' revolving ', ' ringle ', ' roving ', ' violer ', ' virole ']
1.3 There is also a vocabulary list when the name corpus, including 8,000 sex-disaggregated names. The names of males and females are stored in separate files.


' Andie ', ' Andrea ', ' Andy ', ' Angel ', ' Angie ', ' Ariel ', ' Ashley ', ' Aubrey ', ' Augustine ', ...

Let's take a look at the last name, what's the difference between male and female sex?

>>>from Nltk.corpus import NAMES>>>CFD=NLTK. Conditionalfreqdist ((fileid,name[-1)) for Fileid in Names.fileids () for name in Names.words (Fileid)) >>> Cfd.plot

It can be seen that most of the names ending in A,e,i are female, with k,o,r,s,t ending as males.

2. Pronunciation of the dictionary

The 2.1NLTK includes the CMU pronunciation Dictionary of American English, which was designed for the speech synthesizer.

>>>entries=nltk.corpus.cumdict.entries () >>>print (len (entries)) 133737>>>for entry in entries[39943:39948]:        print (Entry) (' Explorer ', [' IH0 ', ' K ', ' S ', ' P ', ' L ', ' AO1 ', ' R ', ' ER0 ']) (' Explorers ', [' IH0 ', ' K ', ' s ', ' P ', ' l ', ' AO1 ', ' r ', ' ER0 ', ' Z ']) (' explores ', [' IH0 ', ' K ', ' s ', ' P ', ' l ', ' AO1 ', ' r ', ' Z ') ') (' Exploring ', [' IH0 ')] , ' K ', ' S ', ' P ', ' L ', ' AO1 ', ' R ', ' IH0 ', ' NG ']) (' Explosion ', [' IH0 ', ' K ', ' s ', ' P ', ' L ', ' OW1 ', ' ZH ', ' AH0 ', ' N '))

Example: Look for an entry with three phonemes in the pronunciation, and the first pronunciation is ' P ', the third pronunciation is ' T ', the word satisfying the condition is printed and the second phoneme of the word

>>>entries=nltk.corpus.cmudict.entries () >>>for Word,pron in entries:            if Len (pron) ==3:PH1,PH2 , ph3=pronif ph1== ' P ' and ph3== ' T ':p rint (WORD,PH2)

Results: pait EY1 Pat AE1 Pate EY1 Patt AE1 Peart ER1 peat IY1 Peet IY1 peete IY1 Pert ER1 pet EH1 Pete IY1 Pett EH1 Piet IY1 Pi Ette IY1 pit IH1 Pitt IH1
Pot AA1 pote OW1 pott AA1 pout AW1 Puett UW1 purt ER1 put UH1 putt AH1

The phonemes contain numbers that denote the main accent (1), secondary accent (2), and no accent (0). Define a function to extract accented numbers, and then look for words with a specific accent pattern.

>>>entries=nltk.corpus.cmudict.entries () >>>def stress (pron):    return [char for phone in pron for Char in Phone if char.isdigit ()]### #寻找音素为01020的词汇 >>>res=[w for W,pron in entries if stress (pron) ==[' 0 ', ' 1 ', ' 0 ', ' 2 ', ' 0 ']]>>>print (res) [' abbreviated ', ' abbreviated ', ' abbreviating ', ' accelerated ', ' accelerating ', ' Accelerator ', ' accelerators ', ' accentuated ', ' accentuating ', ' accommodated ', ' accommodating ', ' accommodative ', ' Accumulated ', ' accumulating ', ' accumulative ', ' accumulator ', ' accumulators ' ...

3. Comparison Glossary

NLTK contains a list of so-called Swadesh core words, including a list of about 200 common words in several languages.

>>>from Nltk.corpus Import swadesh>>>swadesh.fileids () [' Be ', ' BG ', ' BS ', ' CA ', ' cs ', ' cu ', ' de ', ' en ', ' Es ', ' fr ', ' hr ', ' it ', ' la ', ' mk ', ' nl ', ' pl ', ' pt ', ' ro ', ' ru ', ' SK ', ' SL ', ' SR ', ' SW ', ' UK ']

You can use the entries () method to develop a list of languages to access multi-lingual cognate words. Moreover, it can be converted into a simple dictionary,

>>>fr2en=swadesh.entries ([' fr ', ' en '])  # # #法语和英语 >>>translate=dict (fr2en) >>> translate[' Chien '  # # #进行翻译 ' dog ' >>>translate[' jeter '] ' throw '

Natural Language Processing 2.3--dictionary resources

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More