Getting started with the use of some natural language tools in Python

Last Update:2016-06-10 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

NLTK is an excellent tool for using Python teaching and practical computational linguistics. In addition, computational linguistics is closely related to artificial intelligence, language/specialized language recognition, translation, and grammar checking.
What does NLTK include?

NLTK are naturally seen as a series of layers with a stack structure built upon each other. Readers who are familiar with the grammar and parsing of artificial languages, such as Python, do not have much difficulty understanding the similar-but more esoteric-layers of natural language models.
List of terms

Complete set (corpora): A collection of related text. For example, Shakespeare's works may be collectively referred to as a Corpus (corpus), while a number of authors ' works are called complete collections.

Histogram (histogram): a statistical distribution of the frequency of occurrences of different words, letters, or other items in a dataset.

Structure (syntagmatic): The study of segments, that is, the statistical relationships in which letters, words, or phrases appear consecutively in the complete works.

Context-independent syntax (Context-free grammar): The second class in the Noam Chomsky hierarchy consisting of four types of formal syntax. See resources for a detailed description.

Although NLTK comes with many complete works that have been preprocessed (often manually) to varying degrees, each layer of the concept relies on adjacent, lower-level processing. First, the word is broken, then the word is tagged, and then the group of words are parsed into grammatical elements, such as noun phrases or sentences (depending on one of several techniques, each technique has its advantages and disadvantages), and finally the final statement or other grammatical units are categorized. With these steps, NLTK allows you to generate statistics on the occurrence of different elements and draw graphs that describe the process itself or the aggregated results of the statistics.

In this article, you will see some relatively complete examples of low-level capabilities, and most of the high levels of capabilities will be simply abstract descriptions. Now let's take a detailed analysis of the first steps in text processing.

Word breaker (tokenization)

Much of the work you can do with NLTK, especially low-level work, is not much different than using Python's basic data structure to do it. However, NLTK provides a set of systematic interfaces that are dependent on and used by the higher layers, rather than simply providing useful classes for handling flagged or tagged text.

Specifically, the Nltk.tokenizer.Token class is widely used to store annotated fragments of text, which can mark many different features, including speech (Parts-of-speech), sub-symbol (SUBTOKEN) structure, and a flag (token). The offset position in the larger text, the morphological stems, the grammatical statement composition, and so on. In fact, a Token is a special dictionary--and accessed as a dictionary--so it can hold any key you want. Some special keys are used in NLTK, and different keys are used by different sub-packages.

Let's briefly examine how to create a flag and split it into sub-flags:
Listing 1. Primary knowledge of Nltk.tokenizer.Token class

>>> from Nltk.tokenizer import *>>> t = Token (text= ' This is my first Test sentence ') >>> Wstoken Izer (). Tokenize (T, Addlocs=true) # Break on whitespace>>> print t[' TEXT ']this is my first Test SENTENCE&GT;&GT;&G T Print t[' Subtokens ' [
 
  
   @[0:4c], 
   
    @[5:7c], 
    
     @[8:10c], 
     
      @[11:16c], 
      
       @[17:21c], 
       
        @[ 22:30c]]>>> t[' foo '] = ' bar ' >>> T 
        
         @[0:4c], 
         
          @[5:7c], 
          
            @[8:10c], 
           
            @[11:16c], 
            
             @[17:21c], 
             
               @[22:30c]]>>>> print t[' Subtokens '][0] 
              
               @[0:4c]>>> print type (t[' Subt Okens '][0])

Probability (probability)

One of the fairly simple things you might want to do for a language complete is to analyze the frequency distribution of events and make probabilistic predictions based on these known frequency distributions. NLTK supports a variety of probabilistic prediction methods based on natural frequency distribution data. I will not introduce those methods here (see the probability tutorials listed in resources), as long as there is some ambiguity between what you are sure to expect and what you already know (more than the obvious scaling/normalization).

Basically, NLTK supports two types of frequency distribution: histogram and conditional frequency distribution (conditional frequency). The Nltk.probability.FreqDist class is used to create histograms; For example, you can create a histogram of words like this:
Listing 2. Create a basic histogram using nltk.probability.FreqDist

>>> from nltk.probability import *>>> Article = Token (Text=open (' Cp-b17.txt '). Read ()) >>> Wstokenizer (). tokenize (article) >>> freq = Freqdist () >>> for word in article[' subtokens ']: ...   Freq.inc (word[' TEXT ') >>> freq. B () 1194>>> freq.count (' Python ') 12

The probability tutorial discusses the creation of histograms of more complex features, such as "the length of a word after a word ending with a vowel." The Nltk.draw.plot.Plot class is useful for visualizing the histogram. Of course, you can also analyze the frequency distribution of high-level syntax features or even data sets that are not related to NLTK.

The conditional frequency distribution may be more interesting than the normal histogram. The conditional frequency distribution is a two-dimensional histogram-it displays a histogram for you by each initial condition or "context". For example, the tutorial presents a word length distribution problem that corresponds to each initial letter. This is how we analyze:
Listing 3. Conditional frequency Distribution: The word length corresponding to each initial letter

>>> CF = Conditionalfreqdist () >>> for word in article[' subtokens ']: ...   cf[word[' text '][0]].inc (len (word[' text ')) ...>>> init_letters = cf.conditions () >>> Init_ Letters.sort () >>> for C in init_letters[44:50]: ...   Print "Init%s:"% c,...   For-length in range (1,6): ...     Print "Len%d/%.2f,"% (Length,cf[c].freq (n)),...   Print ...  Init A:len 1/0.03, Len 2/0.03, Len 3/0.03, Len 4/0.03, Len 5/0.03,init b:len 1/0.12, Len 2/0.12, Len 3/0.12, Len 4/0.12,  Len 5/0.12,init c:len 1/0.06, Len 2/0.06, Len 3/0.06, Len 4/0.06, Len 5/0.06,init d:len 1/0.06, Len 2/0.06, Len 3/0.06,  Len 4/0.06, Len 5/0.06,init e:len 1/0.18, Len 2/0.18, Len 3/0.18, Len 4/0.18, Len 5/0.18,init f:len 1/0.25, Len 2/0.25, Len 3/0.25, Len 4/0.25, Len 5/0.25,

An excellent application of conditional frequency distribution in language is to analyze the distribution of segments in the complete works-for example, to give a particular word, and the next most likely word. Of course, grammar brings some limitations; however, the study of the choice of syntactic options belongs to semantics, pragmatics and terminology.

Stem extraction (stemming)

The Nltk.stemmer.porter.PorterStemmer class is an extremely handy tool for getting the syntax-compliant (prefix) stems from English words. This ability is especially exciting for me because I used to create a common, full-text index search tool/library in Python (see Developing a full-text indexer in Python, which has been used in quite a few other projects).

While the ability to search a large number of documents for a set of exact words is very useful (Gnosis.indexer's work), it can be helpful to have a little bit of ambiguity about a lot of search diagrams. You may not be particularly sure whether the e-mail you are looking for uses the word "complicated", "complications", "complicating", or "complicates", but you remember that it is about the content (possibly with some other Words together to complete a valuable search).

The NLTK includes an excellent algorithm for word stemming and allows you to customize the stemming algorithm to your liking:
Listing 4. Extracting word stems for the root of the language (morphological roots)

>>> from Nltk.stemmer.porter import porterstemmer>>> Porterstemmer (). Stem_word (' complications ') ' Complic '

In fact, how you can take advantage of the stemming functionality in gnosis.indexer and its derivatives, or entirely different indexing tools, depends on your usage scenario. Fortunately, Gnosis.indexer has an open interface that is easy to customize. Do you need an index that is entirely stem-formed? Or do you include both full words and stems in the index? Do you need to isolate the stemming in the results from the exact match? In future versions of Gnosis.indexer I will introduce some kinds of stem extraction capabilities, but end users may still want to make different customizations.

In any case, adding stemming is generally straightforward: first, you get stemming from a document by specifically specifying Gnosis.indexer.TextSplitter, and then, of course, when you do a search, (optionally) extract its stems before using the search criteria for index lookups. This may be done by customizing your Myindexer.find () method.

When I used Porterstemmer I found that the Nltk.tokenizer.WSTokenizer class did not work as well as the tutorial warned. It can be a conceptual role, but for actual text, you can better identify what a word is. Fortunately, Gnosis.indexer.TextSplitter is a robust word-breaker tool. For example:
Listing 5. Stemming based on the poor NLTK word breaker tool

>>> from Nltk.tokenizer import *>>> Article = Token (Text=open (' Cp-b17.txt '). Read ()) >>> Wstokenizer (). tokenize (article) >>> from nltk.probability import *>>> from Nltk.stemmer.porter Import *>>> Stemmer = Porterstemmer () >>> stems = freqdist () >>> for word in article[' subtokens ' ]:...   Stemmer.stem (word)   ... Stems.inc (word[' STEM '].lower ()) ...>>> Word_stems = Stems.samples () >>> word_stems.sort () >> > word_stems[20:40][' Generator-bas ', ' implement ', ' Lazili ', ' magic ' ', ' ' Partial ', ' pluggable ' ', ' primitives ' ', ' ' repres ', ' secur ', ' semi-coroutines ' ', ' ' state ', ' understand ', ' weightless ', ' Whatev ', ' # ', ' #-----', ' #------ ----', ' #-------------', ' #---------------', ' #b17: ']

Looking at some stems, the stems in the collection do not appear to be available for indexing. Many of them are not actual words at all, and others are combined with dashes, and the words are added to some irrelevant punctuation. Let's try using a better word breaker tool:
Listing 6. Using the clever heuristic method of word breaker to extract stems

>>> Article = TS (). Text_splitter (Open (' cp-b17.txt '). Read ()) >>> stems = freqdist () >>> for Word in article:   ... Stems.inc (Stemmer.stem_word (Word.lower ())) ...>>> Word_stems = Stems.samples () >>> Word_stems.sort () >>> word_stems[60:80][' bool ', ' both ', ' Boundari ', ' brain ', ' bring ', ' built ', ' but ', ' byte ', ' call ', ' can ', ' Cannot ', ' capabl ', ' capit ', ' Carri ', ' case ', ' cast ', ' certain ', ' Certainli ', ' Chang ', ' charm ']

Here you can see that there are several possible extensions to the words, and that all the words look like words or morphemes. The word breaking method is very important to the random text collection; To be fair, the complete set of NLTK bundles has been packaged into an easy-to-use and accurate word-breaking tool through Wstokenizer (). To get a robust, actually available indexer, you need to use the robust word breaker tool.

Tagging (tagging), chunking (chunking), and parsing (parsing)

The largest part of the NLTK is composed of various parsers with varying degrees of complexity. To a large extent, this presentation will not explain their details, but I would like to give you an idea of what they are going to accomplish.

Do not forget that the logo is a special dictionary of the background-specifically, those can contain a TAG key to indicate the grammatical role of the word symbol. NLTK Complete documentation There are usually some specialized languages that have been pre-tagged, but of course you can add your own tags to documents that are not tagged.

Chunking is somewhat similar to "rough parsing". That is, the work of chunking, or an existing flag based on grammatical components, or a semi-automatic generation of symbols based on your manual additions or using regular expressions and program logic. But, to be exact, this is not true parsing (no same build rule). For example:
Listing 7. Block parsing/tagging: words and larger units

>>> from nltk.parser.chunk import chunkedtaggedtokenizer>>> chunked = "[The/dt little/jj Cat/nn] sat/v BD on/in [The/dt mat/nn] ">>> sentence = Token (text=chunked) >>> tokenizer = Chunkedtaggedtokenizer (chu Nk_node= ' NP ') >>> tokenizer.tokenize (sentence) >>> sentence[' Subtokens '][0] (NP:
 
  
   
    
     ) >>> sentence[' subtokens '][0][' NODE ' ' NP ' >>> sentence[' subtokens '][0][' Children '][0] 
     
       >>> sentence[' subtokens '][0][' children '][0][' TAG '] ' DT ' >>> chunk_structure = Treetoken (node= ' S ',
          children=sentence[' Subtokens ') (S: (NP: 
      
       
        
         ) 
         
          
           (NP:) 
           
            
             )

The block work mentioned can be done by the Nltk.tokenizer.RegexpChunkParser class using pseudo-regular expressions to describe a series of tags that make up grammatical elements. Here is an example of the probability tutorial:
Listing 8. Use regular expressions on labels for chunking

>>> rule1 = Chunkrule ('
 
  
  
   
   
  ?
  
   
    
   *
   
    
     
    ',...        ' Chunk optional Det, zero or more adj, and a noun ') >>> chunkparser = Regexpchunkparser ([rule1], chunk_node= ' NP ', t Op_node= ' s ') >>> chunkparser.parse (sentence) >>> print sent[' TREE ' (S: (NP: 
    
      
     
       
      
       
         ) 
         
         
           (NP: 
           
           
             ))

The real analysis will lead us into many theoretical fields. For example, the Top-down parser ensures that every possible product is found, but it can be very slow because it is frequently (exponentially) backtracking. Shift-reduce is more efficient, but may miss some products. In either case, the declaration of a grammar rule is similar to a syntax declaration that resolves an artificial language. This column has introduced some of these: Simpleparse, MX. Texttools, Spark, and gnosis.xml.validity (see Resources).

Even in addition to the Top-down and Shift-reduce parsers, NLTK provides a "chart parser" that can create partial assumptions so that a given sequence can then complete a rule. This method can be both effective and complete. To give a vivid (toy-grade) Example:
Listing 9. Defining a basic product for context-independent syntax

>>> from Nltk.parser.chart import *>>> grammar = Cfg.parse ("...  NP VP S ... VP-V NP |  VP PP ... V-"Saw" |  "Ate" ... NP-"John" | "Mary" | "Bob" | Det N |  NP PP ... Det, "a" | "An" | "The" |  "My" ... N-"Dog" | "Cat" |  "Cookie" ...  PP-P NP ... P--"on" | "By" |  "With" ... ") >>> sentence = Token (text= ' John saw a cat with my cookie ') >>> Wstokenizer (). Tokenize (sentence) > >> parser = Chartparser (grammar, Bu_strategy, leaf= ' TEXT ') >>> parser.parse_n (sentence) >>> for Tree in sentence[' TREES ': Print tree (S: (NP:
 
  
   ) (VP: (VP: (V:) (NP: (Det:) (n:))) 
   
    ) ( 
    
     PP: (P: 
     
      ) (NP: (Det:) ( 
      
       N: )))) 
       
        (S: (NP:) ( 
        
         VP: (V:)) ( 
         
          NP: (NP: (Det:) (N 
          
           : )) (PP: (P: 
           
            ) (NP: (Det: 
            
             ) (N: 
             
               ))))))

Probabilistic Context-free grammar (or pcfg) is a context-independent syntax that associates each of its products to a single probability. Similarly, parsers for probabilistic parsing are bundled into the NLTK.

What are you waiting for?

NLTK also has other important features that cannot be covered in this short introduction. For example, NLTK has a complete framework for text categorization using statistical techniques similar to models such as "naive Bayesian" and "Maximum entropy". Even if there is space, I cannot explain the essence of it now. However, I think that even the lower layer of NLTK can be a practical framework that can be used both for teaching applications and for practical applications.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More