An introductory tutorial on the use of some natural language tools in Python _python

Source: Internet
Author: User
Tags nltk in python

NLTK is an excellent tool for using Python teaching as well as practical computational linguistics. In addition, computational linguistics is closely related to artificial intelligence, language/specialized language recognition, translation and grammar checking.
What does NLTK include?

NLTK will naturally be seen as a series of layers with stack structures built on each other. For those familiar with the grammar and parsing of the human language (such as Python), it is not too difficult to understand the similar-but more esoteric-layers of the natural language model.
Terminology List

Complete set (Corpora): A collection of related text. For example, Shakespeare's works may be collectively referred to as a Corpus (corpus), while the works of several authors are called complete.

Histogram (histogram): the statistical distribution of the frequency of occurrences of different words, letters, or other entries in a data set.

Structure (syntagmatic): The study of the paragraph, that is, the continuous occurrence of letters, words or phrases in the complete statistical relationship.

Context-free syntax (Context-free grammar): The second category in the Noam Chomsky hierarchy consisting of four classes of formal syntax. See resources for a detailed description.

Although NLTK comes with many of the complete collections that have been preprocessed (usually manually) to varying degrees, each layer of the concept relies on adjacent, lower-level processing. The first is word-breaking; then the words are tagged; then the group words are parsed into grammatical elements, such as noun phrases or sentences (depending on one of several techniques, each of which has its advantages and disadvantages), and finally the final statement or other syntactic units are categorized. With these steps, NLTK allows you to generate statistics about the occurrence of different elements and to draw graphs that describe the process itself or the results of statistical totals.

In this article, you will see some relatively complete examples of low-level capabilities, and most high-level capabilities will simply be described in simple abstractions. Now let's take a detailed analysis of the first steps of text processing.

Word breaker (tokenization)

A lot of the work that you can do with NLTK, especially low-level work, doesn't make much difference than using Python's basic data structure. However, NLTK provides a set of systematized interfaces that are dependent on and used by the higher layers, rather than simply providing a practical class to handle tagged or tagged text.

Specifically, the Nltk.tokenizer.Token class is widely used to store annotated fragments of text, which can be labeled with many different features, including speech (Parts-of-speech), child sign (SUBTOKEN) structure, and a flag (Token) The offset position in the larger text, the word stem (morphological stems), the grammatical statement composition, and so on. In fact, a Token is a special dictionary--and is accessed as a dictionary--so it can hold any of the keys you want. Some special keys are used in NLTK, and different keys are used by different child packages.

Let's briefly analyze how to create a flag and split it into a child flag:
Listing 1. First knowledge of Nltk.tokenizer.Token class

>>> from nltk.tokenizer import *
>>> t = Token (text= ' This is I-i-test sentence ')
>>&G T Wstokenizer () tokenize (T, Addlocs=true) # break on whitespace
>>> print t[' TEXT ']
Entence
>>> print t[' subtokens ']
[<this>@[0:4c], <is>@[5:7c], <my>@[8:10c], < FIRST>@[11:16C],
<test>@[17:21c], <sentence>@[22:30c]]
>>> t[' foo ' = ' bar '
>>> T
<text= ' This is i-sentence ', foo= ' bar ',
subtokens=[<this>@[0:4c], <is >@[5:7C], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c] >
>>> print t[' Subtokens '][0]
<this>@[0:4c]
>>> print type (t[' Subtokens ') ][0])
<class ' Nltk.token.SafeToken ' >

Probability (probability)

One of the fairly simple things you might want to do with the complete language is to analyze the frequency distributions of the various events, and to make probabilistic predictions based on these known frequency distributions. NLTK supports a variety of probabilistic forecasting methods based on natural frequency distribution data. I will not be introducing those methods here (see Resources for the probability tutorials), as long as there are some vague relationships that you would expect to have with those (not just obvious scaling/normalization) that you already know.

Basically, NLTK supports two types of frequency distributions: histograms and conditional frequency distributions (conditional frequency). The Nltk.probability.FreqDist class is used to create histograms; For example, you can create a word histogram like this:
Listing 2. Use Nltk.probability.FreqDist to create a basic histogram

>>> from nltk.probability Import *
>>> Article = Token (Text=open (' Cp-b17.txt '). Read ())
> >> Wstokenizer (). tokenize (article)
>>> freq = freqdist ()
>>> for word in article[' Subtokens ']: ...   Freq.inc (word[' TEXT '])
>>> freq. B ()
1194
>>> freq.count (' Python ')
12

The probability tutorial discusses the creation of histograms for more complex features, such as "the length of words following the end of a vowel." The Nltk.draw.plot.Plot class can be used for visual display of histograms. Of course, you can also analyze the frequency distributions of high-level grammatical features or even NLTK-independent datasets.

The conditional frequency distribution may be more interesting than the normal histogram. The conditional frequency distribution is a two-dimensional histogram-it displays a histogram for you by each initial condition or "context". For example, the tutorial presents a problem that corresponds to the length distribution of a word for each initial letter. So we're going to analyze this:
Listing 3. Conditional frequency Distribution: The word length corresponding to each initial letter

>>> CF = Conditionalfreqdist ()
>>> for word in article[' Subtokens ']:
...   cf[word[' text '][0]].inc (len (word[' text '))
...
>>> init_letters = cf.conditions ()
>>> init_letters.sort ()
>>> for C in Init_ LETTERS[44:50]:
...   Print "Init%s:"% c,
...   For length in range (1,6):
...     Print "Len%d/%.2f,"% (Length,cf[c].freq (n)),
...   Print
...
Init A:len 1/0.03, Len 2/0.03, Len 3/0.03, Len 4/0.03, Len 5/0.03,
Init b:len 1/0.12, Len 2/0.12, Len 3/0.12, Len 4 /0.12, Len 5/0.12,
Init c:len 1/0.06, Len 2/0.06, Len 3/0.06, Len 4/0.06, Len 5/0.06,
init d:len 1/0.06, Len 2 /0.06, Len 3/0.06, Len 4/0.06, Len 5/0.06,
Init e:len 1/0.18, Len 2/0.18, Len 3/0.18, Len 4/0.18, Len 5/0.18,
I NIT F:len 1/0.25, Len 2/0.25, Len 3/0.25, Len 4/0.25, Len 5/0.25,

An excellent application of conditional frequency distribution in language is to analyze the distribution of segments in the complete set-for example, by giving a specific word, which is the most likely word to come up next. Of course, grammar can bring some limitations, however, the study of the choice of syntactic options belongs to the category of semantics, pragmatics and terminology.

Stem extraction (stemming)

The Nltk.stemmer.porter.PorterStemmer class is an extremely convenient tool for obtaining grammatical (prefix) stems from English words. This ability is particularly exciting because I have previously used Python to create a common, full-text indexed search Tool/library (see Developing a full-text indexer in Python, which has been used in quite a number of other projects).

Although the ability to search a large number of documents for a specific set of words is very useful (gnosis.indexer do), a little ambiguity is helpful for many search graphs. You may not be particularly sure whether the e-mail you are looking for uses the word "complicated", "complications", "complicating", or "complicates", but you remember that it was probably involved (possibly with some other Words together to complete a worthwhile search).

The NLTK includes an excellent algorithm for word stem extraction, and allows you to customize the stemming algorithm to your liking:
Listing 4. Extracts word stems for the root of the language (morphological roots)

>>> from nltk.stemmer.porter import porterstemmer
>>> Porterstemmer (). Stem_word (' Complications ')
' complic '

In fact, how you can take advantage of gnosis.indexer and its derivatives or the stemming function in a completely different indexing tool depends on your usage scenario. Fortunately, Gnosis.indexer has an open interface that is easy to customize. Do you need an index that is composed entirely of stems? Or do you include complete words and stems in the index? Do you need to separate the stemming from the exact match from the results? In future versions of Gnosis.indexer I will introduce some kind of stemming extraction capabilities, but end users may still want to make different customizations.

In any case, it is very simple to add stem extraction in general: first, by specifying Gnosis.indexer.TextSplitter to derive stemming from a document, and then, of course, when performing a search, (optionally) extracting its stem before using search criteria for index lookups, It may be done by customizing your Myindexer.find () method.

When using Porterstemmer I found that the Nltk.tokenizer.WSTokenizer class did not work as well as the tutorials warned. It is competent for conceptual roles, but for actual text, you can better identify what a "word" is. Fortunately, Gnosis.indexer.TextSplitter is a robust tool for word-breaker. For example:
Listing 5. Stem extraction based on poor NLTK word-breaker tool

>>> from nltk.tokenizer Import *
>>> Article = Token (Text=open (' Cp-b17.txt '). Read ())
> >> Wstokenizer () tokenize (article)
>>> from nltk.probability import *
>>> from Nltk.stemmer.porter Import *
>>> stemmer = Porterstemmer ()
>>> stems = freqdist ()
> >> for word in article[' Subtokens ']:
...   Stemmer.stem (Word) ...   Stems.inc (word[' STEM '].lower ())
...
>>> Word_stems = stems.samples ()
>>> word_stems.sort ()
>>> word_stems[20:40]
[' Generator-bas ', ' ' Implement ', ' ' Lazili ', ' ' magic ', ' ' Partial ',
' ' Pluggable ' ', ' ' Primitives ', ' ', ' ' Repres ', ' secur ', ' ' Semi-coroutines. ', ' ' State
', ' understand ', ' weightless ', ' Whatev ', ' # ', ' #-----',
' #----------', ' #-------------', ' #---------------', ' #b17: '

Looking at some stems, the stems in the collection do not appear to all be available for indexing. Many are not real words at all, others are combinations of dashes, and words are added to some irrelevant punctuation. Let's try using a better word breaker:
Listing 6. Using the ingenious heuristic method in Word-breaker tool to extract stem

>>> Article = TS (). Text_splitter (Open (' cp-b17.txt '). Read ())
>>> stems = freqdist ()
> >> for Word in article:
...   Stems.inc (Stemmer.stem_word (Word.lower ()))
...
>>> Word_stems = stems.samples ()
>>> word_stems.sort ()
>>> word_stems[60:80]
[' bool ', ' both ', ' Boundari ', ' brain ', ' bring ', ' built ', ' but ', ' byte ',
' call ', ' can ', ' cannot ', ' capabl ', ' CAPI ' T ', ' Carri ', ' case ', ' cast ',
' certain ', ' Certainli ', ' Chang ', ' charm '

Here you can see that there are several possible extensions to some words, and that all words look like words or morphemes. The method of word breaker is very important to random text collection. To be fair, the complete set of NLTK bundles has been packaged by Wstokenizer () as an easy-to-use and accurate tool for word-breaker. To get a robust, physically available indexer, you need to use a robust word-breaker tool.

Tagging (tagging), chunking (chunking), and parsing (parsing)

The largest portion of the NLTK is composed of various parsers of varying degrees of complexity. To a large extent, this introduction will not explain their details, but I would like to outline what they are going to achieve.

Don't forget that the logo is a special dictionary of the background-specifically those that can contain a tag key to indicate the grammatical role of the word. NLTK the complete set of documents usually has a portion of the specialized language already tagged, but you can of course add your own tags to documents that are not tagged.

Chunking is somewhat similar to "rough parsing." That is, the work of chunking, or the existing logo based on the grammatical component, or a semi-automatic build based on your manual additions or using regular expressions and program logic. But, to be exact, this is not really parsing (without the same generation rule). For example:
Listing 7. Chunking parsing/tagging: words and larger units

>>> from nltk.parser.chunk import chunkedtaggedtokenizer
>>> chunked = "[The/dt little/jj Cat/nn] SAT/VBD on/in [The/dt mat/nn] "
>>> sentence = Token (text=chunked)
>>> tokenizer = Chunkedtagg Edtokenizer (chunk_node= ' NP ')
>>> tokenizer.tokenize (sentence)
>>> sentence[' Subtokens ' ][0]
(NP: <the/DT> <little/JJ> <cat/NN>)
>>> sentence[' subtokens '][0][' NODE ']
' NP '
>>> sentence[' subtokens '][0][' CHILDREN '][0]
<the/DT>
>>> sentence[' subtokens '][0][' CHILDREN '][0][' TAG ']
' DT '
>>> chunk_structure = Treetoken (node= ' S ', children=sentence[' Subtokens ']
(S:
 (NP: <the/DT> <little/JJ> <cat/NN>)
 <sat/ Vbd>
 <on/IN>
 (NP: <the/DT> <mat/NN>))

The block work mentioned can be accomplished by using a pseudo regular expression in the Nltk.tokenizer.RegexpChunkParser class to describe a series of tags that form grammatical elements. Here is an example of the probability tutorial:
Listing 8. Block using regular expressions on labels

>>> rule1 = chunkrule (' <DT>?<JJ.*>*<NN.*> ',
... ')        Chunk optional Det, zero or more adj, and a noun ')
>>> chunkparser = Regexpchunkparser ([rule1], chunk_node= ' N P ', top_node= ' s ')
>>> chunkparser.parse (sentence)
>>> print sent[' tree '
(S: (NP: < the/dt> <little/JJ> <cat/NN>)
 <sat/VBD> <on/IN>
 (NP: <the/DT> <mat/nn >))

The real analysis will lead us into many theoretical fields. For example, the Top-down parser can make sure that every possible product is found, but it can be very slow, since backtracking is frequent (on a number of levels). Shift-reduce are more efficient, but may miss out on some products. In either case, the syntax rule declaration is similar to parsing the syntax declaration of an artificial language. This column has introduced some of these: Simpleparse, MX. Texttools, Spark and gnosis.xml.validity (see Resources).

Even, in addition to the Top-down and Shift-reduce parsers, NLTK provides a "chart parser" that can create part of the assumption that a given sequence can then complete a rule. This method can be both effective and complete. Give a vivid (toy-Class) Example:
Listing 9. Define a basic product for context-free syntax

 >>> from nltk.parser.chart import * >>> grammar = cfg.parse (' ...  S-> NP VP ... VP-> V NP |  VP PP ... V-> "Saw" |  "Ate" ... NP-> "John" | "Mary" | "Bob" | Det N |  NP PP ... Det-> "a" | "An" | "The" |  "My" ... N-> "Dog" | "Cat" |  "Cookie" ...  PP-> P NP ... P-> "on" | "By" |  "With" ... "") >>> sentence = Token (text= ' John saw a cat with my cookie ') >>> Wstokenizer (). Tokenize (sentence) ;>> parser = Chartparser (grammar, Bu_strategy, leaf= ' TEXT ') >>> parser.parse_n (sentence) >>> For tree in sentence[' Trees ': Print tree (S: (NP: <John>) (VP: (VP: (V: <saw>) (NP: (Det: <a>) (N:  <cat>)) (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))) (S: (NP: <John>) (VP: (V: <saw>) (NP: (NP: (Det: <a>) (N: <cat>)) (PP: (P: <with>) (NP: (Det: <my>) (N: <c ookie>))) 

Probabilistic Context-free grammar (or pcfg) is a context-independent syntax that associates each product with a probability. Similarly, parsers for probabilistic parsing are bundled into the NLTK.

What are you waiting for?

NLTK also has other important features that are not covered in this brief introduction. For example, NLTK has a complete framework for text categorization through statistical techniques such as "naive Bayesian" and "Maximum entropy". Even though there is still space, I cannot explain its essence now. However, I believe that even a lower layer of NLTK can be a practical framework that can be used both for instructional applications and for practical applications.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.