NLTK Learning: Classifying and labeling vocabularies

Source: Internet
Author: User
Tags nltk
[TOC]

Part-of-speech labeling device

A lot of the work after that will require the words to be marked out. NLTK comes with English labelpos_tag

Import Nltktext = Nltk.word_tokenize ("And now for something compleyely difference") print (text) print (Nltk.pos_tag (text) )

Labeling Corpus

Represents an identifier that has been annotated:nltk.tag.str2tuple('word/类型')

Text = "The/at grand/jj is/vbd." Print ([Nltk.tag.str2tuple (t) for T in Text.split ()])

Reading a corpus that has been labeled

The NLTK Corpus UE navel provides a unified interface that allows you to ignore different file formats. Format: 语料库.tagged_word()/tagged_sents() . Parameters can specify categories and fields

Print (Nltk.corpus.brown.tagged_words ())

Nouns, verbs, adjectives, etc.

Here's an example of a noun.

From Nltk.corpus import Brownword_tag = nltk. Freqdist (Brown.tagged_words (categories= "News")) print ([word+ '/' +tag for (word,tag) in Word_tag if Tag.startswith (' V ') ]) ############### #下面是查找money的不同标注 ################################ #wsj = brown.tagged_words (categories= "News") CfD = NLTK. Conditionalfreqdist (WSJ) Print (cfd[' money '].keys ())

Try to find the most frequent noun in each noun type.

def findtag (tag_prefix,tagged_text):    cfd = nltk. Conditionalfreqdist ((Tag,word) for (Word,tag) in Tagged_text if Tag.startswith (tag_prefix)) return Dict (Tag,list (cfd[ Tag].keys ()) [: 5]) for tag in Cfd.conditions ()) #数据类型必须转换为list才能进行切片操作tagdict = Findtag (' NN ', nltk.corpus.brown.tagged_ Words (categories= "News")) for tag in sorted (tagdict):p rint (Tag,tagdict[tag])

Explore the already labeled corpus

Need nltk.bigrams() and nltk.trigrams() , respectively, correspond to 2-gram model and 3-gram model.

brown_tagged = Brown.tagged_words (categories= "learned") tags = [b[1] for (A, B) in Nltk.bigrams (brown_tagged) if a[0]== " Often "]FD = NLTK. Freqdist (Tags) fd.tabulate ()

Automatic labeling

Default Label

The simplest notation is to assign a uniform token to each identifier. Here is a label that turns all the words into NN. and used evaluate() for testing. When many words are nouns, it facilitates the first analysis and improves stability.

brown_tagged_sents = brown.tagged_sents (categories= "news") raw = ' I do not a like eggs and ham, I don't like them Sam I am ' t Okens = Nltk.word_tokenize (raw) Default_tagger = NLTK. Defaulttagger (' NN ') #创建标注器print (Default_tagger.tag (tokens)) # Call the tag () method to label print (Default_tagger.evaluate (brown_ tagged_sents))

Regular expression Marker

Note Here the rules are fixed (determined by yourself). When the rules become more and more perfect, the higher the accuracy.

Patterns = [(R '.    *ing$ ', ' VBG '), (R '.    *ed$ ', ' VBD '), (R '.    *es$ ', ' VBZ '),    (R '. * ', ' NN ') #为了方便, only a few rules] Regexp_tagger = NLTK. Regexptagger (Patterns) regexp_tagger.evaluate (brown_tagged_sents)

Query labels

Here and the book is a difference, different from Python2, pay attention to debugging. and the query label is the most likely to store the tag, and can be set backoff parameters, cannot be marked in the case of the use of this marker ( This process is fallback)

FD = NLTK. Freqdist (Brown.words (categories= "News")) cfd = NLTK. Conditionalfreqdist (Brown.tagged_words (categories= "News")) ############################################## The difference between Python2 and 3 ######## #most_freq_words = Fd.most_common (+) Likely_tags = Dict ((Word,cfd[word].max ()) for (Word,times) In most_freq_words) ###################################################################### #baseline_tagger = nltk. Unigramtagger (MODEL=LIKELY_TAGS,BACKOFF=NLTK. Defaulttagger (' NN ')) baseline_tagger.evaluate (brown_tagged_sents)

N-gram Labeling

The underlying unary callout

The behavior of a unary marker is similar to finding a callout, and the technique of establishing a unary marker is for training .

Here our marker is just a memory training set, rather than building a generic model, then the match is good, but cannot be generalized to the new text.

size = Int (len (brown_tagged_sents) *0.9) train_sents = Brown_tagged_sents[:size]test_sents = brown_tagged_sents[size+1 :]unigram_tagger = nltk. Unigramtagger (train_sents) unigram_tagger.evaluate (test_sents)

General N-gram-Label

The n-ary tag is the word that retrieves index= N and retrieves the tags of the n-n<=index<=n-1. By using the tag tag of the preceding word, the tag of the current vocabulary is further determined. Similarly nltk.UnigramTagger() , the self-contained two-dollar callout is: nltk.BigramTagger() consistent usage.

Combo Label

Many times, algorithms that cover a wider range are more useful than algorithms with higher precision. Use backoff the specified fallback marker to implement a combination of the labels. When the parameter is cutoff explicitly declared as int, the context that appears only 1-n times is automatically discarded.

T0 = nltk. Defaulttagger (' NN ') t1 = nltk. Unigramtagger (train_sents,backoff=t0) t2 = nltk. Bigramtagger (TRAIN_SENTS,BACKOFF=T1) t2.evaluate (test_sents)

Can be found, compared with the original, the accuracy significantly improved

Labeling across sentence boundaries

There are no first n words for the first word of a sentence. Workaround: Use tagged tagged_sents to train the label.

Conversion-based annotations: Brill Labels

Better than all the above. The idea of implementation: start with a large sum, then fix the details and make a little bit of change.
Not only the memory is small, but also the context, and according to the problem of small, real-time correction errors, rather than static. Of course, the calls in Python3 and Python2 are different.

From Nltk.tag import Brillbrill.nltkdemo18plus () Brill.nltkdemo18 ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.