[TOC]
Part-of-speech labeling device
A lot of the work after that will require the words to be marked out. NLTK comes with English labelpos_tag
Import Nltktext = Nltk.word_tokenize ("And now for something compleyely difference") print (text) print (Nltk.pos_tag (text) )
Labeling Corpus
Represents an identifier that has been annotated:nltk.tag.str2tuple('word/类型')
Text = "The/at grand/jj is/vbd." Print ([Nltk.tag.str2tuple (t) for T in Text.split ()])
Reading a corpus that has been labeled
The NLTK Corpus UE navel provides a unified interface that allows you to ignore different file formats. Format: 语料库.tagged_word()/tagged_sents()
. Parameters can specify categories and fields
Print (Nltk.corpus.brown.tagged_words ())
Nouns, verbs, adjectives, etc.
Here's an example of a noun.
From Nltk.corpus import Brownword_tag = nltk. Freqdist (Brown.tagged_words (categories= "News")) print ([word+ '/' +tag for (word,tag) in Word_tag if Tag.startswith (' V ') ]) ############### #下面是查找money的不同标注 ################################ #wsj = brown.tagged_words (categories= "News") CfD = NLTK. Conditionalfreqdist (WSJ) Print (cfd[' money '].keys ())
Try to find the most frequent noun in each noun type.
def findtag (tag_prefix,tagged_text): cfd = nltk. Conditionalfreqdist ((Tag,word) for (Word,tag) in Tagged_text if Tag.startswith (tag_prefix)) return Dict (Tag,list (cfd[ Tag].keys ()) [: 5]) for tag in Cfd.conditions ()) #数据类型必须转换为list才能进行切片操作tagdict = Findtag (' NN ', nltk.corpus.brown.tagged_ Words (categories= "News")) for tag in sorted (tagdict):p rint (Tag,tagdict[tag])
Explore the already labeled corpus
Need nltk.bigrams()
and nltk.trigrams()
, respectively, correspond to 2-gram model and 3-gram model.
brown_tagged = Brown.tagged_words (categories= "learned") tags = [b[1] for (A, B) in Nltk.bigrams (brown_tagged) if a[0]== " Often "]FD = NLTK. Freqdist (Tags) fd.tabulate ()
Automatic labeling
Default Label
The simplest notation is to assign a uniform token to each identifier. Here is a label that turns all the words into NN. and used evaluate()
for testing. When many words are nouns, it facilitates the first analysis and improves stability.
brown_tagged_sents = brown.tagged_sents (categories= "news") raw = ' I do not a like eggs and ham, I don't like them Sam I am ' t Okens = Nltk.word_tokenize (raw) Default_tagger = NLTK. Defaulttagger (' NN ') #创建标注器print (Default_tagger.tag (tokens)) # Call the tag () method to label print (Default_tagger.evaluate (brown_ tagged_sents))
Regular expression Marker
Note Here the rules are fixed (determined by yourself). When the rules become more and more perfect, the higher the accuracy.
Patterns = [(R '. *ing$ ', ' VBG '), (R '. *ed$ ', ' VBD '), (R '. *es$ ', ' VBZ '), (R '. * ', ' NN ') #为了方便, only a few rules] Regexp_tagger = NLTK. Regexptagger (Patterns) regexp_tagger.evaluate (brown_tagged_sents)
Query labels
Here and the book is a difference, different from Python2, pay attention to debugging. and the query label is the most likely to store the tag, and can be set backoff
parameters, cannot be marked in the case of the use of this marker ( This process is fallback)
FD = NLTK. Freqdist (Brown.words (categories= "News")) cfd = NLTK. Conditionalfreqdist (Brown.tagged_words (categories= "News")) ############################################## The difference between Python2 and 3 ######## #most_freq_words = Fd.most_common (+) Likely_tags = Dict ((Word,cfd[word].max ()) for (Word,times) In most_freq_words) ###################################################################### #baseline_tagger = nltk. Unigramtagger (MODEL=LIKELY_TAGS,BACKOFF=NLTK. Defaulttagger (' NN ')) baseline_tagger.evaluate (brown_tagged_sents)
N-gram Labeling
The underlying unary callout
The behavior of a unary marker is similar to finding a callout, and the technique of establishing a unary marker is for training .
Here our marker is just a memory training set, rather than building a generic model, then the match is good, but cannot be generalized to the new text.
size = Int (len (brown_tagged_sents) *0.9) train_sents = Brown_tagged_sents[:size]test_sents = brown_tagged_sents[size+1 :]unigram_tagger = nltk. Unigramtagger (train_sents) unigram_tagger.evaluate (test_sents)
General N-gram-Label
The n-ary tag is the word that retrieves index= N and retrieves the tags of the n-n<=index<=n-1. By using the tag tag of the preceding word, the tag of the current vocabulary is further determined. Similarly nltk.UnigramTagger()
, the self-contained two-dollar callout is: nltk.BigramTagger()
consistent usage.
Combo Label
Many times, algorithms that cover a wider range are more useful than algorithms with higher precision. Use backoff
the specified fallback marker to implement a combination of the labels. When the parameter is cutoff
explicitly declared as int, the context that appears only 1-n times is automatically discarded.
T0 = nltk. Defaulttagger (' NN ') t1 = nltk. Unigramtagger (train_sents,backoff=t0) t2 = nltk. Bigramtagger (TRAIN_SENTS,BACKOFF=T1) t2.evaluate (test_sents)
Can be found, compared with the original, the accuracy significantly improved
Labeling across sentence boundaries
There are no first n words for the first word of a sentence. Workaround: Use tagged tagged_sents to train the label.
Conversion-based annotations: Brill Labels
Better than all the above. The idea of implementation: start with a large sum, then fix the details and make a little bit of change.
Not only the memory is small, but also the context, and according to the problem of small, real-time correction errors, rather than static. Of course, the calls in Python3 and Python2 are different.
From Nltk.tag import Brillbrill.nltkdemo18plus () Brill.nltkdemo18 ()