This article mainly introduces some tutorials on using natural language tools in Python. This article is from the IBM official website technical documentation, for more information, see NLTK. it is an excellent tool for teaching Python and practicing computational linguistics. In addition, computational linguistics is closely related to artificial intelligence, language/specialized language recognition, translation, syntax check, and other fields.
What does NLTK include?
NLTK is naturally seen as a series of layers with stack structures built on each other. For those who are familiar with grammar and parsing in human languages (such as Python), it is not very difficult to understand the similar-but more profound-layer in natural language models.
Glossary
Corpora: a collection of related texts. For example, Shakespeare's work may be collectively referred to as a collection (corpus), and the work of several authors is called the complete set.
Histogram (Histogram): Statistical distribution of the frequency of occurrence of different words, letters, or other items in a dataset.
Syntagmatic: the study of segments, that is, the continuous statistical relationship of letters, words, or phrases in the complete set.
Context-free grammar: The second class in the Noam Chomsky level consisting of four formal syntaxes. See references for detailed descriptions.
Although NLTK comes with many complete sets that have been pre-processed (usually manually) to different degrees, each layer is dependent on adjacent lower-level processing. First, Break the word, then add a tag for the word, and then parse the group of words into a syntax element, such as a noun phrase or sentence (depending on one of several technologies, each technology has its own advantages and disadvantages). Finally, the final statements or other syntax units are classified. Through these steps, NLTK allows you to generate statistics on the occurrence of different elements, and draw a chart describing the processing process itself or the total statistical results.
In this article, you will see some relatively complete examples of low-level capabilities, and most of the high-level capabilities will be simply abstract descriptions. Now let's analyze the first step of text processing in detail.
Tokenization)
You can use NLTK to do a lot of work, especially at the lower layer. it is no big difference compared to using Python's basic data structure. However, NLTK provides a set of systematic interfaces that higher layers depend on and use, rather than simply providing practical classes to handle text that has been tagged or tagged.
Specifically, nltk. tokenizer. the Token class is widely used to store annotation fragments of text. These annotations can mark many different features, including parts-of-speech and subtoken) structure, the offset position of a token in a larger text, the morphological stems, and the syntax statement components. In fact, a Token is a special dictionary-and accessed in dictionary form-so it can hold any key you want. Some special keys are used in NLTK. different keys are used by different subroutine packages.
Let's briefly analyze how to create a flag and split it into sub-flag:
Listing 1. nltk. tokenizer. Token class first recognized
>>> from nltk.tokenizer import *>>> t = Token(TEXT='This is my first test sentence')>>> WSTokenizer().tokenize(t, addlocs=True) # break on whitespace>>> print t['TEXT']This is my first test sentence>>> print t['SUBTOKENS'][
@[0:4c],
@[5:7c],
@[8:10c],
@[11:16c],
@[17:21c],
@[22:30c]]>>> t['foo'] = 'bar'>>> t
@[0:4c],
@[5:7c],
@[8:10c],
@[11:16c],
@[17:21c],
@[22:30c]]>>>> print t['SUBTOKENS'][0]
@[0:4c]>>> print type(t['SUBTOKENS'][0])
Probability)
For the complete set of languages, you may have to analyze the frequency distribution of various events and make probability predictions based on these known frequency distributions. NLTK supports multiple probability prediction methods based on natural frequency distribution data. I will not introduce those methods here (refer to the probability tutorials listed in references ), it is enough to describe the fuzzy relationship between the expected ones and those you already know (not just the obvious scaling ratio/normalization.
Basically, NLTK supports two types of frequency distribution: histogram and conditional frequency distribution ). The nltk. probability. FreqDist class is used to create a histogram. for example, you can create a word histogram like this:
Listing 2. using nltk. probability. FreqDist to create a basic histogram
>>> from nltk.probability import *>>> article = Token(TEXT=open('cp-b17.txt').read())>>> WSTokenizer().tokenize(article)>>> freq = FreqDist()>>> for word in article['SUBTOKENS']:... freq.inc(word['TEXT'])>>> freq.B()1194>>> freq.count('Python')12
The probability tutorial discusses the creation of histograms for more complex features, such as "the length of the word after the word ending with a vowel ". The nltk. draw. plot. Plot class can be used for visual display of histograms. Of course, you can also analyze the frequency distribution of high-level syntax features or even datasets unrelated to NLTK.
Conditional frequency distribution may be more interesting than normal histograms. Conditional frequency distribution is a two-dimensional histogram that displays a histogram for you based on each initial condition or context. For example, the tutorial raises the issue of word length distribution corresponding to each letter. We will analyze it as follows:
Listing 3. conditional frequency distribution: corresponding to the word length of each initials
>>> cf = ConditionalFreqDist()>>> for word in article['SUBTOKENS']:... cf[word['TEXT'][0]].inc(len(word['TEXT']))...>>> init_letters = cf.conditions()>>> init_letters.sort()>>> for c in init_letters[44:50]:... print "Init %s:" % c,... for length in range(1,6):... print "len %d/%.2f," % (length,cf[c].freq(n)),... print...Init a: len 1/0.03, len 2/0.03, len 3/0.03, len 4/0.03, len 5/0.03,Init b: len 1/0.12, len 2/0.12, len 3/0.12, len 4/0.12, len 5/0.12,Init c: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,Init d: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,Init e: len 1/0.18, len 2/0.18, len 3/0.18, len 4/0.18, len 5/0.18,Init f: len 1/0.25, len 2/0.25, len 3/0.25, len 4/0.25, len 5/0.25,
A good application of conditional frequency distribution in terms of language is to analyze the CIDR block distribution in the complete set-for example, to give a specific word, which word is most likely to appear next. Of course, syntax imposes some limitations. However, the selection of syntactic options belongs to the semantic, pragmatic, and terminology fields.
Stem extraction (Stemming)
The nltk. stemmer. porter. PorterStemmer class is an extremely convenient tool for obtaining syntactic (prefix) stem from English words. This capability is especially exciting for me, because I used to create a public, full text index search tool/library in Python (see the description in Developing a full-text indexer in Python, it has been used in many other projects ).
Although the ability to search a large number of documents for a specific set of words is very practical (gnosis. the work done by indexer). However, for a lot of search charts, fuzzy search may be helpful. Maybe you are not sure whether the words "complicated", "complications", "complicating", or "complicates" are used in the email you are looking ", however, you can remember that it is probably involved (it may be used together with some other words to complete a valuable search ).
NLTK includes an excellent algorithm for word stem extraction, and allows you to customize the stem extraction algorithm according to your preferences:
Listing 4. extract word stem for morphological roots
>>> from nltk.stemmer.porter import PorterStemmer>>> PorterStemmer().stem_word('complications')'complic'
In fact, how you can use the stem extraction function in gnosis. indexer and its derivatives or completely different indexing tools depends on your usage scenario. Fortunately, gnosis. indexer has an open interface that is easy to customize. Do you need an index completely composed of stem words? Or are you sure you want to include both the complete word and stem in the index? Do you need to separate the stem matching from the exact match? In future versions of gnosis. indexer, I will introduce some types of stem extraction capabilities. however, end users may still want to customize them differently.
In any case, it is generally very easy to add stem extraction: first, by specifying gnosis. indexer. textSplitter is used to obtain the stem of words from a document. then, of course, when you perform a search, (optional) extract the stem of words before you use the search criteria for index search. it may be by customizing your MyIndexer. find () method.
When using PorterStemmer, I found that the nltk. tokenizer. WSTokenizer class is really not as useful as the warning in the tutorial. It is qualified for conceptual roles, but for actual texts, you can better identify what is a "word ". Fortunately, gnosis. indexer. TextSplitter is a robust word breaking tool. For example:
Listing 5. stem extraction based on a poor NLTK word breaking tool
>>> from nltk.tokenizer import *>>> article = Token(TEXT=open('cp-b17.txt').read())>>> WSTokenizer().tokenize(article)>>> from nltk.probability import *>>> from nltk.stemmer.porter import *>>> stemmer = PorterStemmer()>>> stems = FreqDist()>>> for word in article['SUBTOKENS']:... stemmer.stem(word)... stems.inc(word['STEM'].lower())...>>> word_stems = stems.samples()>>> word_stems.sort()>>> word_stems[20:40]['"generator-bas', '"implement', '"lazili', '"magic"', '"partial','"pluggable"', '"primitives"', '"repres', '"secur', '"semi-coroutines."','"state', '"understand', '"weightless', '"whatev', '#', '#-----','#----------', '#-------------', '#---------------', '#b17:']
View some stem words. the stem in the set does not seem to be available for indexing. There are many other words that are not actually connected. some other words are combined with broken signs, and some irrelevant punctuation marks are added to the words. Let's use a better word breaking tool to try:
Listing 6. use smart heuristic methods in word breaking tools for STEM extraction
>>> article = TS().text_splitter(open('cp-b17.txt').read())>>> stems = FreqDist()>>> for word in article:... stems.inc(stemmer.stem_word(word.lower()))...>>> word_stems = stems.samples()>>> word_stems.sort()>>> word_stems[60:80]['bool', 'both', 'boundari', 'brain', 'bring', 'built', 'but', 'byte','call', 'can', 'cannot', 'capabl', 'capit', 'carri', 'case', 'cast','certain', 'certainli', 'chang', 'charm']
Here, you can see that some words have multiple extensions, and all words look like words or elements. The word breaking method is crucial to a set of random texts. To be fair, the complete NLTK bundle has been packaged as an easy-to-use and accurate word breaking tool through WSTokenizer. To obtain a robust and practical indexer, you must use a robust word breaking tool.
Tagging, chunking, and parsing)
The largest part of NLTK is composed of various resolvers with different levels of complexity. To a large extent, this article will not explain their details, but I would like to give a rough look at what they will do.
Do not forget that a TAG is a special dictionary background-a TAG that can contain a TAG key to specify the syntax role of a word. The complete NLTK documentation usually has tags added in some specialized languages in advance. However, you can add your own tags to documents without tags.
Some parts are similar to "rough parsing ". That is to say, the progress of multipart work, the existing mark based on the syntax component, or the sign that you manually add or use the regular expression and the program logic for semi-automatic generation. However, to be exact, this is not a real resolution (there is no same generation rule ). For example:
Listing 7. partition parsing/adding tags: Words and larger units
>>> from nltk.parser.chunk import ChunkedTaggedTokenizer>>> chunked = "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]">>> sentence = Token(TEXT=chunked)>>> tokenizer = ChunkedTaggedTokenizer(chunk_node='NP')>>> tokenizer.tokenize(sentence)>>> sentence['SUBTOKENS'][0](NP:
)>>> sentence['SUBTOKENS'][0]['NODE']'NP'>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]['TAG']'DT'>>> chunk_structure = TreeToken(NODE='S', CHILDREN=sentence['SUBTOKENS'])(S: (NP:
)
(NP:
))
The mentioned multipart work can be completed by the nltk. tokenizer. RegexpChunkParser class using a pseudo-regular expression to describe a series of tags that constitute the syntax element. Here is an example in the probability tutorial:
Listing 8. use the regular expression on the tag to partition
>>> rule1 = ChunkRule('
?
*
',... 'Chunk optional det, zero or more adj, and a noun')>>> chunkparser = RegexpChunkParser([rule1], chunk_node='NP', top_node='S')>>> chunkparser.parse(sentence)>>> print sent['TREE'](S: (NP:
)
(NP:
))
The true analysis will lead us into many theoretical fields. For example, the top-down parser ensures that every possible product is found, but it may be very slow because frequent (exponential) backtracking is required. Shift-reduce is more efficient, but some products may be missed. In either case, the declaration of a syntax rule is similar to that of a parsing human language. This column has introduced some of the following: SimpleParse, mx. TextTools, Spark, and gnosis. xml. validity (see references ).
Even in addition to the top-down and shift-reduce parsers, NLTK also provides the "chart parsers", which can create partial assumptions, such a given sequence can then complete a rule. This method can be both effective and complete. A vivid (toy-level) example:
Listing 9. defining basic products for context-free syntax
>>> from nltk.parser.chart import *>>> grammar = CFG.parse('''... S -> NP VP... VP -> V NP | VP PP... V -> "saw" | "ate"... NP -> "John" | "Mary" | "Bob" | Det N | NP PP... Det -> "a" | "an" | "the" | "my"... N -> "dog" | "cat" | "cookie"... PP -> P NP... P -> "on" | "by" | "with"... ''')>>> sentence = Token(TEXT='John saw a cat with my cookie')>>> WSTokenizer().tokenize(sentence)>>> parser = ChartParser(grammar, BU_STRATEGY, LEAF='TEXT')>>> parser.parse_n(sentence)>>> for tree in sentence['TREES']: print tree(S: (NP:
) (VP: (VP: (V:
) (NP: (Det: ) (N:
))) (PP: (P:
) (NP: (Det:
) (N:
)))))(S: (NP:
) (VP: (V:
) (NP: (NP: (Det: ) (N:
)) (PP: (P:
) (NP: (Det:
) (N:
))))))
Probabilistic context-free grammar (or PCFG) is a context-independent syntax that associates each of its products with a probability. Likewise, the parser used for probability resolution is bound to the NLTK.
What are you waiting?
NLTK also has other important functions not covered in this brief introduction. For example, NLTK has a complete framework for text classification through statistical techniques similar to "naive Bayesian" and "maximum entropy. Even if there is still space, I still cannot explain its nature. However, I think that even a lower NLTK layer can become a practical framework that can be used both for teaching applications and for practical applications.