Natural language 16_chunking with NLTK

Source: Internet
Author: User
Tags nltk

Chunking with NLTK




Now so we know the parts of speech, we can do what's called chunking, and group words into hopefully meaningful chunks. One of the main goals of Chunking is to group into and what known as "noun phrases." These is phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe somethin G like an adverb. The idea was to group nouns with the words, that was in relation to them.

In order to chunk, we combine the part of speech tags with regular expressions. Mainly from regular expressions, we is going to utilize the following:

+ = match 1 or More = match 0or 1 Repetitions.* = match 0  or more repetitions . Span class= "pun" >= any character except< Span class= "PLN" > a new line         

See the tutorial linked above if you need help with regular expressions. The last things to note was the part of speech tags was denoted with the ' < ' and ' > ' and we can also place Regul AR expressions within the tags themselves, so account for things as "all nouns" (<N.*>)

ImportNltkFromNltk.CorpusImportState_unionFromNltk.TokenizeImport PunktsentencetokenizerTrain_text=State_union.Raw("2005-gwbush.txt")Sample_text=State_union.Raw("2006-gwbush.txt")Custom_sent_tokenizer= Punktsentencetokenizer(Train_text)Tokenized=Custom_sent_tokenizer.Tokenize(Sample_text)DefProcess_content(): Try: ForIInchTokenized:Words=Nltk.Word_tokenize(I)Tagged=Nltk.Pos_tag(Words)Chunkgram=R"" "Chunk: {<rb.? >*<vb.? >*<nnp>+<nn>?} "" "Chunkparser=Nltkregexpparser (chunkgram chunked = Chunkparser. (tagged)  Chunked . draw ()  except exception as E:  print (str ( eprocess_content ()           

The result of this was something like:

The main line here in question is:

= R"" "Chunk: {<rb.? >*<vb.? >*<nnp>+<nn>?}  "" "

This line, broken down:

<rb.? >* = "0 or more of any tense of adverb," followed by:

<vb.? >* = "0 or more of any tense of verb," followed by:

<nnp>+ = "one or more proper nouns," followed by

<nn>? = "zero or one singular noun."

Try playing around with combinations to group various instances until, feel comfortable with chunking.

Not covered in the video, but also a reasonable task was to actually access the chunks specifically. This was something rarely talked about, but can be a essential step depending on what do you ' re doing. Say you print the chunks out, you is going to see output like:

(S  (Chunk president/nnp george/nnp w./nnp bush/nnp)  ' S/pos  (Chunk    address/nnp    before/nnp    a/nnp    joint/nnp    session/nnp OF/NNP THE/NNP    CONGRESS/NNP    ON/NNP    the/nnp    state/nnp    of/nnp    the/nnp    union/nnp    january/nnp)  31/cd  , /,  2006/cd  the/dt  (Chunk president/nnp)  :/:  (Chunk thank/nnp)  YOU/PRP  All/dt  ./.)

Cool, that's helps us visually, but what if we want to access the this data via our program? Well, what's happening here are our "chunked" variable are an NLTK tree. Each "chunk" and "non chunk" is a "subtree" of the tree. We can reference these by doing something like Chunked.subtrees. We can then iterate through these subtrees like so:

            For in chunked.  Subtrees():print(subtree)        

Next, we might is only interested in getting just the chunks, ignoring the rest. We can use the filter parameter in the Chunked.subtrees () call.

 for subtree in Chunked. (filter=lambda  T: T. ()  ==  ' Chunk ' Span class= "pun"):  print (subtree             

Now, we ' re filtering to only show the subtrees with the label of "Chunk." Keep in mind, this isn ' t "Chunk" as in the NLTK Chunk attribute ... this is "Chunk" literally because that ' s the label we G Ave It Here:chunkgram = r "" "Chunk: {<rb.? >*<vb.? >*<nnp>+<nn>?} "" "

Had we said instead something like Chunkgram = r "" "Pythons: {<rb.? >*<vb.? >*<nnp>+<nn>?} "", then we would filter by the label of "pythons." The result here should is something like:

-(Chunk president/nnp GEORGE/NNP w./nnp bush/nnp) (Chunk address/nnp before/nnp a/nnp joint/nnp  SESSION/NNP  OF/NNP  the/nnp  congress/nnp  on/nnp  the/nnp  state/nnp  of/nnp  the/nnp  union/ NNP  JANUARY/NNP) (Chunk president/nnp) (Chunk THANK/NNP)

Full code for this would is:

ImportNltkFromNltk.CorpusImportState_unionFromNltk.TokenizeImport PunktsentencetokenizerTrain_text=State_union.Raw("2005-gwbush.txt")Sample_text=State_union.Raw("2006-gwbush.txt")Custom_sent_tokenizer= Punktsentencetokenizer(Train_text)Tokenized=Custom_sent_tokenizer.Tokenize(Sample_text)DefProcess_content(): Try: ForIInchTokenized:Words=Nltk.Word_tokenize(I)Tagged=Nltk.Pos_tag(Words)Chunkgram=R"" "Chunk: {<rb.? >*<vb.? >*<nnp>+<nn>?} "" "Chunkparser=Nltk.Regexpparser(Chunkgram)Chunked=Chunkparser.Parse(Tagged) Print(Chunked) ForSubtreeInchChunked.Subtrees(Filter=LambdaT: T. ()  ==  ' Chunk ' Span class= "pun"):  print (subtree Chunked. ()  except  Exception as E: print (str (e process_content ()       

If you get the particular enough, you may find the May is better off if there was a by-and-chunk everything, except some Stuff. This is a known as chinking, and that's what we ' re going to be covering next.

Natural language 16_chunking with NLTK

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.