Natural language 16_chunking with NLTK

Last Update:2016-11-19 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Chunking with NLTK

Now so we know the parts of speech, we can do what's called chunking, and group words into hopefully meaningful chunks. One of the main goals of Chunking is to group into and what known as "noun phrases." These is phrases of one or more words that contain a noun, maybe some descriptive words, maybe a verb, and maybe somethin G like an adverb. The idea was to group nouns with the words, that was in relation to them.

In order to chunk, we combine the part of speech tags with regular expressions. Mainly from regular expressions, we is going to utilize the following:

+ = match 1 or More = match 0or 1 Repetitions.* = match 0  or more repetitions . Span class= "pun" >= any character except< Span class= "PLN" > a new line

See the tutorial linked above if you need help with regular expressions. The last things to note was the part of speech tags was denoted with the ' < ' and ' > ' and we can also place Regul AR expressions within the tags themselves, so account for things as "all nouns" (<N.*>)

ImportNltkFromNltk.CorpusImportState_unionFromNltk.TokenizeImport PunktsentencetokenizerTrain_text=State_union.Raw("2005-gwbush.txt")Sample_text=State_union.Raw("2006-gwbush.txt")Custom_sent_tokenizer= Punktsentencetokenizer(Train_text)Tokenized=Custom_sent_tokenizer.Tokenize(Sample_text)DefProcess_content(): Try: ForIInchTokenized:Words=Nltk.Word_tokenize(I)Tagged=Nltk.Pos_tag(Words)Chunkgram=R"" "Chunk: {<rb.? >*<vb.? >*<nnp>+<nn>?} "" "Chunkparser=Nltkregexpparser (chunkgram chunked = Chunkparser. (tagged)  Chunked . draw ()  except exception as E:  print (str ( eprocess_content ()

The result of this was something like:

The main line here in question is:

= R"" "Chunk: {<rb.? >*<vb.? >*<nnp>+<nn>?}  "" "

This line, broken down:

<rb.? >* = "0 or more of any tense of adverb," followed by:

<vb.? >* = "0 or more of any tense of verb," followed by:

<nnp>+ = "one or more proper nouns," followed by

<nn>? = "zero or one singular noun."

Try playing around with combinations to group various instances until, feel comfortable with chunking.

Not covered in the video, but also a reasonable task was to actually access the chunks specifically. This was something rarely talked about, but can be a essential step depending on what do you ' re doing. Say you print the chunks out, you is going to see output like:

(S  (Chunk president/nnp george/nnp w./nnp bush/nnp)  ' S/pos  (Chunk    address/nnp    before/nnp    a/nnp    joint/nnp    session/nnp OF/NNP THE/NNP    CONGRESS/NNP    ON/NNP    the/nnp    state/nnp    of/nnp    the/nnp    union/nnp    january/nnp)  31/cd  , /,  2006/cd  the/dt  (Chunk president/nnp)  :/:  (Chunk thank/nnp)  YOU/PRP  All/dt  ./.)

Cool, that's helps us visually, but what if we want to access the this data via our program? Well, what's happening here are our "chunked" variable are an NLTK tree. Each "chunk" and "non chunk" is a "subtree" of the tree. We can reference these by doing something like Chunked.subtrees. We can then iterate through these subtrees like so:

            For in chunked.  Subtrees():print(subtree)

Next, we might is only interested in getting just the chunks, ignoring the rest. We can use the filter parameter in the Chunked.subtrees () call.

 for subtree in Chunked. (filter=lambda  T: T. ()  ==  ' Chunk ' Span class= "pun"):  print (subtree

Now, we ' re filtering to only show the subtrees with the label of "Chunk." Keep in mind, this isn ' t "Chunk" as in the NLTK Chunk attribute ... this is "Chunk" literally because that ' s the label we G Ave It Here:chunkgram = r "" "Chunk: {<rb.? >*<vb.? >*<nnp>+<nn>?} "" "

Had we said instead something like Chunkgram = r "" "Pythons: {<rb.? >*<vb.? >*<nnp>+<nn>?} "", then we would filter by the label of "pythons." The result here should is something like:

-(Chunk president/nnp GEORGE/NNP w./nnp bush/nnp) (Chunk address/nnp before/nnp a/nnp joint/nnp  SESSION/NNP  OF/NNP  the/nnp  congress/nnp  on/nnp  the/nnp  state/nnp  of/nnp  the/nnp  union/ NNP  JANUARY/NNP) (Chunk president/nnp) (Chunk THANK/NNP)

Full code for this would is:

ImportNltkFromNltk.CorpusImportState_unionFromNltk.TokenizeImport PunktsentencetokenizerTrain_text=State_union.Raw("2005-gwbush.txt")Sample_text=State_union.Raw("2006-gwbush.txt")Custom_sent_tokenizer= Punktsentencetokenizer(Train_text)Tokenized=Custom_sent_tokenizer.Tokenize(Sample_text)DefProcess_content(): Try: ForIInchTokenized:Words=Nltk.Word_tokenize(I)Tagged=Nltk.Pos_tag(Words)Chunkgram=R"" "Chunk: {<rb.? >*<vb.? >*<nnp>+<nn>?} "" "Chunkparser=Nltk.Regexpparser(Chunkgram)Chunked=Chunkparser.Parse(Tagged) Print(Chunked) ForSubtreeInchChunked.Subtrees(Filter=LambdaT: T. ()  ==  ' Chunk ' Span class= "pun"):  print (subtree Chunked. ()  except  Exception as E: print (str (e process_content ()

If you get the particular enough, you may find the May is better off if there was a by-and-chunk everything, except some Stuff. This is a known as chinking, and that's what we ' re going to be covering next.

Natural language 16_chunking with NLTK

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More