NLTK Learning: Classifying and labeling vocabularies

Last Update:2017-06-23 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[TOC]

Part-of-speech labeling device

A lot of the work after that will require the words to be marked out. NLTK comes with English labelpos_tag

Import Nltktext = Nltk.word_tokenize ("And now for something compleyely difference") print (text) print (Nltk.pos_tag (text) )

Labeling Corpus

Represents an identifier that has been annotated:nltk.tag.str2tuple('word/类型')

Text = "The/at grand/jj is/vbd." Print ([Nltk.tag.str2tuple (t) for T in Text.split ()])

Reading a corpus that has been labeled

The NLTK Corpus UE navel provides a unified interface that allows you to ignore different file formats. Format: 语料库.tagged_word()/tagged_sents() . Parameters can specify categories and fields

Print (Nltk.corpus.brown.tagged_words ())

Nouns, verbs, adjectives, etc.

Here's an example of a noun.

From Nltk.corpus import Brownword_tag = nltk. Freqdist (Brown.tagged_words (categories= "News")) print ([word+ '/' +tag for (word,tag) in Word_tag if Tag.startswith (' V ') ]) ############### #下面是查找money的不同标注 ################################ #wsj = brown.tagged_words (categories= "News") CfD = NLTK. Conditionalfreqdist (WSJ) Print (cfd[' money '].keys ())

Try to find the most frequent noun in each noun type.

def findtag (tag_prefix,tagged_text):    cfd = nltk. Conditionalfreqdist ((Tag,word) for (Word,tag) in Tagged_text if Tag.startswith (tag_prefix)) return Dict (Tag,list (cfd[ Tag].keys ()) [: 5]) for tag in Cfd.conditions ()) #数据类型必须转换为list才能进行切片操作tagdict = Findtag (' NN ', nltk.corpus.brown.tagged_ Words (categories= "News")) for tag in sorted (tagdict):p rint (Tag,tagdict[tag])

Explore the already labeled corpus

Need nltk.bigrams() and nltk.trigrams() , respectively, correspond to 2-gram model and 3-gram model.

brown_tagged = Brown.tagged_words (categories= "learned") tags = [b[1] for (A, B) in Nltk.bigrams (brown_tagged) if a[0]== " Often "]FD = NLTK. Freqdist (Tags) fd.tabulate ()

Automatic labeling

Default Label

The simplest notation is to assign a uniform token to each identifier. Here is a label that turns all the words into NN. and used evaluate() for testing. When many words are nouns, it facilitates the first analysis and improves stability.

brown_tagged_sents = brown.tagged_sents (categories= "news") raw = ' I do not a like eggs and ham, I don't like them Sam I am ' t Okens = Nltk.word_tokenize (raw) Default_tagger = NLTK. Defaulttagger (' NN ') #创建标注器print (Default_tagger.tag (tokens)) # Call the tag () method to label print (Default_tagger.evaluate (brown_ tagged_sents))

Regular expression Marker

Note Here the rules are fixed (determined by yourself). When the rules become more and more perfect, the higher the accuracy.

Patterns = [(R '.    *ing$ ', ' VBG '), (R '.    *ed$ ', ' VBD '), (R '.    *es$ ', ' VBZ '),    (R '. * ', ' NN ') #为了方便, only a few rules] Regexp_tagger = NLTK. Regexptagger (Patterns) regexp_tagger.evaluate (brown_tagged_sents)

Query labels

Here and the book is a difference, different from Python2, pay attention to debugging. and the query label is the most likely to store the tag, and can be set backoff parameters, cannot be marked in the case of the use of this marker ( This process is fallback)

FD = NLTK. Freqdist (Brown.words (categories= "News")) cfd = NLTK. Conditionalfreqdist (Brown.tagged_words (categories= "News")) ############################################## The difference between Python2 and 3 ######## #most_freq_words = Fd.most_common (+) Likely_tags = Dict ((Word,cfd[word].max ()) for (Word,times) In most_freq_words) ###################################################################### #baseline_tagger = nltk. Unigramtagger (MODEL=LIKELY_TAGS,BACKOFF=NLTK. Defaulttagger (' NN ')) baseline_tagger.evaluate (brown_tagged_sents)

N-gram Labeling

The underlying unary callout

The behavior of a unary marker is similar to finding a callout, and the technique of establishing a unary marker is for training .

Here our marker is just a memory training set, rather than building a generic model, then the match is good, but cannot be generalized to the new text.

size = Int (len (brown_tagged_sents) *0.9) train_sents = Brown_tagged_sents[:size]test_sents = brown_tagged_sents[size+1 :]unigram_tagger = nltk. Unigramtagger (train_sents) unigram_tagger.evaluate (test_sents)

General N-gram-Label

The n-ary tag is the word that retrieves index= N and retrieves the tags of the n-n<=index<=n-1. By using the tag tag of the preceding word, the tag of the current vocabulary is further determined. Similarly nltk.UnigramTagger() , the self-contained two-dollar callout is: nltk.BigramTagger() consistent usage.

Combo Label

Many times, algorithms that cover a wider range are more useful than algorithms with higher precision. Use backoff the specified fallback marker to implement a combination of the labels. When the parameter is cutoff explicitly declared as int, the context that appears only 1-n times is automatically discarded.

T0 = nltk. Defaulttagger (' NN ') t1 = nltk. Unigramtagger (train_sents,backoff=t0) t2 = nltk. Bigramtagger (TRAIN_SENTS,BACKOFF=T1) t2.evaluate (test_sents)

Can be found, compared with the original, the accuracy significantly improved

Labeling across sentence boundaries

There are no first n words for the first word of a sentence. Workaround: Use tagged tagged_sents to train the label.

Conversion-based annotations: Brill Labels

Better than all the above. The idea of implementation: start with a large sum, then fix the details and make a little bit of change.
Not only the memory is small, but also the context, and according to the problem of small, real-time correction errors, rather than static. Of course, the calls in Python3 and Python2 are different.

From Nltk.tag import Brillbrill.nltkdemo18plus () Brill.nltkdemo18 ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More