This article mainly introduces the Python NLP introductory tutorial, Python Natural Language Processing (NLP), using Python's NLTK library. NLTK is Python's Natural language Processing toolkit, one of the most commonly used Python libraries in the NLP world. Small series feel very good, now share to everyone, also for everyone to make a reference. Follow the small series together to see it, hope to help everyone.
What is NLP?
In simple terms, natural language processing (NLP) is the development of applications or services that understand human language.
Some practical examples of natural language processing (NLP) are discussed here, such as speech recognition, speech translation, comprehension of complete sentences, understanding of synonyms for matching words, and the generation of grammatically correct and complete sentences and paragraphs.
This is not everything that NLP can do.
NLP implementation
Search engine: such as Google, Yahoo and so on. Google search engine knows you are a technician, so it shows the results related to technology;
Social networking push: Like Facebook News Feed. If the news feed algorithm knows that your interests are natural language processing, the relevant ads and posts will be displayed.
Voice Engine: Like Apple's Siri.
Spam filtering: such as Google spam filter. Unlike regular spam filtering, it determines whether it is spam by understanding the deep meaning inside the message content.
NLP Library
Here are some open source natural language processing libraries (NLP):
Natural Language Toolkit (NLTK);
Apache OPENNLP;
Stanford NLP Suite;
Gate NLP Library
The Natural Language Toolkit (NLTK) is the most popular natural language processing library (NLP), written in Python and backed by a very strong community.
NLTK is also easy to get started with, in fact, it is the simplest natural language processing (NLP) library.
In this NLP tutorial, we will use the Python NLTK library.
Installing NLTK
If you are using WINDOWS/LINUX/MAC, you can use Pip to install NLTK:
Pip Install NLTK
Open Python terminal import NLTK Check if NLTK is installed correctly:
Import NLTK
If all goes well, this means that you have successfully installed the NLTK library. The first time you install NLTK, you need to install the NLTK expansion pack by running the following code:
Import Nltknltk.download ()
This will pop up the NLTK download window to select which packages need to be installed:
You can install all the packages because they are small in size, so there is no problem.
Using Python tokenize text
First, we will crawl a Web page content, and then analyze the text to understand the content of the page.
We will use the Urllib module to crawl the Web page:
Import urllib.requestresponse = Urllib.request.urlopen (' http://php.net/') HTML = Response.read () print (HTML)
As you can see from the print results, the results contain many HTML tags that need to be cleaned up.
Then beautifulsoup the module to clean this text:
From BS4 import Beautifulsoupimport urllib.requestresponse = Urllib.request.urlopen (' http://php.net/') HTML = Response.read () soup = beautifulsoup (html, "Html5lib") # This needs to be installed Html5lib module text = Soup.get_text (strip=true) print (text)
Now we get a clean text from the crawled page.
Next, convert the text to tokens, like this:
From BS4 import Beautifulsoupimport urllib.requestresponse = Urllib.request.urlopen (' http://php.net/') HTML = Response.read () soup = beautifulsoup (html, "Html5lib") Text = Soup.get_text (strip=true) tokens = Text.split () print ( Tokens
Statistical frequency
Text has been processed and now uses Python NLTK to count the frequency distributions of tokens.
Can be implemented by calling the Freqdist () method in NLTK:
From BS4 import beautifulsoupimport urllib.requestimport nltkresponse = Urllib.request.urlopen (' http://php.net/') HTML = Response.read () soup = beautifulsoup (html, "Html5lib") Text = Soup.get_text (strip=true) tokens = Text.split () Freq = NLTK. Freqdist (tokens) for key,val in Freq.items (): print (str (key) + ': ' + str (val))
If you search the output, you can find that the most common token is PHP.
You can call the plot function to make a frequency distribution graph:
Freq.plot (cumulative=false) # need to install the Matplotlib library
These are the words above. such as Of,a,an and so on, these words belong to stop word.
In general, discontinued words should be removed to prevent them from affecting the results of the analysis.
Working with discontinued words
NLTK comes with a list of discontinued words in many languages, if you get English stop words:
From Nltk.corpus import stopwordsstopwords.words (' 中文版 ')
Now, modify the code to clear some invalid tokens before drawing:
Clean_tokens = List () sr = stopwords.words (' 中文版 ') for tokens in tokens: If token not in SR: Clean_tokens.append ( Token
The final code should look like this:
From BS4 import beautifulsoupimport urllib.requestimport nltkfrom nltk.corpus Import stopwordsresponse = Urllib.request.urlopen (' http://php.net/') HTML = response.read () soup = beautifulsoup (html, "Html5lib") Text = soup.get_ Text (strip=true) tokens = Text.split () Clean_tokens = List () sr = stopwords.words (' 中文版 ') for tokens in tokens: if not Token in SR: clean_tokens.append (token) freq = NLTK. Freqdist (Clean_tokens) for Key,val in Freq.items (): print (str (key) + ': ' + str (val))
Now do another word frequency chart, the effect will be better than before, because the deletion of the inactive words:
Freq.plot (20,cumulative=false)
Use NLTK tokenize text
Before we split the text into tokens with the split method, we now use NLTK to tokenize the text.
Text is not tokenize before it can be processed, so it is important to tokenize the text. The token process means dividing large parts into small parts.
You can tokenize the paragraphs into sentences, tokenize the sentences into a single word, nltk provide the sentence tokenizer and the word tokenizer respectively.
If there is such a text:
Hello Adam, how is it? I hope everything is going well. Today is a good day, see you dude.
Use sentence Tokenizer to tokenize text into sentences:
From nltk.tokenize Import sent_tokenizemytext = "Hello Adam, how is it?" I hope everything is going well. Today is a good day, see you dude. " Print (Sent_tokenize (mytext))
The output is as follows:
[' Hello Adam, how was you? ', ' I hope everything was going well. ', ' Today's a good day, see you dude. ']
This is what you might think, this is too simple, do not need to use the NLTK Tokenizer can, directly use regular expressions to split the sentence, because each sentence has punctuation and space.
Then look at the following text:
Hello Mr Adam, how is it? I hope everything is going well. Today is a good day, see you dude.
If you use punctuation splitting, Hello Mr will be considered a sentence if you use NLTK:
From nltk.tokenize Import Sent_tokenizemytext = "Hello Mr. Adam, how is it?" I hope everything is going well. Today is a good day, see you dude. " Print (Sent_tokenize (mytext))
The output is as follows:
[' Hello Mr. Adam, how was you? ', ' I hope everything was going well. ', ' Today's a good day, see you dude. '
This is the right split.
Next try the word tokenizer:
From nltk.tokenize Import Word_tokenizemytext = "Hello Mr. Adam, how is it?" I hope everything is going well. Today is a good day, see you dude. " Print (Word_tokenize (mytext))
The output is as follows:
[' Hello ', ' Mr. ', ' Adam ', ', ', ' How ', ' was ', ' You ', '? ', ' I ', ' hope ', ' Everything ', ' was ', ' going ', ' well ', '. ', ' Today ', ' I S ', ' a ', ' good ', ' Day ', ', ', ' see ', ' You ', ' Dude ', '. '
The word "Mr" is also not separated. NLTK uses the Punktsentencetokenizer of the Punkt module, which is part of the nltk.tokenize. And this tokenizer is trained to work in multiple languages.
Non-English tokenize
You can specify the language when Tokenize:
From nltk.tokenize Import sent_tokenizemytext = "Bonjour m. Adam, comment allez-vous?" J ' espère que tout va bien. Aujourd ' hui est un bon jour. " Print (Sent_tokenize (mytext, "French"))
The output results are as follows:
[' Bonjour M. Adam, Comment allez-vous ', ' J ' espère que tout va bien. ', ' aujourd ' hui est un bon jour.]
Synonym processing
Using the Nltk.download () installation interface, one of the packages is wordnet.
WordNet is a database created for natural language processing. It includes some synonym groups and a few short definitions.
You can get a definition and an example of a given word in this way:
From Nltk.corpus Import Wordnetsyn = Wordnet.synsets ("Pain") print (Syn[0].definition ()) print (Syn[0].examples ())
The output is:
A symptom of some physical hurt or disorder
[' The patient developed severe pain and distension ']
WordNet contains a number of definitions:
From Nltk.corpus Import Wordnetsyn = Wordnet.synsets ("NLP") print (Syn[0].definition ()) syn = Wordnet.synsets ("Python") Print (Syn[0].definition ())
The results are as follows:
The branch of Information science, deals with natural language information
Large Old World Boas
You can use WordNet to get synonyms like this:
From Nltk.corpus import wordnetsynonyms = []for syn in Wordnet.synsets (' computer '): For lemma in Syn.lemmas (): sy Nonyms.append (Lemma.name ()) print (synonyms)
Output:
[' Computer ', ' computing_machine ', ' computing_device ', ' data_processor ', ' electronic_computer ', ' Information_ ' Processing_system ', ' calculator ', ' Reckoner ', ' figurer ', ' estimator ', ' computer '
antonym Treatment
You can also get antonyms in the same way:
From Nltk.corpus import wordnetantonyms = []for syn. Wordnet.synsets ("small"): for L in Syn.lemmas (): if L.anton YMS (): antonyms.append (l.antonyms () [0].name ()) print (antonyms)
Output:
[' Large ', ' big ', ' big ']
Stem extraction
In language morphology and information retrieval, stemming is the process of removing the affixes to get root, for example, the stem of working is work.
Search engines Use this technique when indexing pages, so many people write different versions of the same words.
There are many kinds of algorithms to avoid this situation, the most common is the baud stemming algorithm. NLTK has a class called Porterstemmer, which is the implementation of this algorithm:
From nltk.stem Import Porterstemmerstemmer = Porterstemmer () print (Stemmer.stem (' working ')) Print (Stemmer.stem (' Worked '))
The output is:
Work
Work
There are other stem extraction algorithms, such as the Lancaster stemming algorithm.
Non-English STEM extraction
In addition to English, Snowballstemmer also supports 13 different languages.
Supported languages:
From Nltk.stem import snowballstemmerprint (snowballstemmer.languages) ' Danish ', ' Dutch ', ' 中文版 ', ' Finnish ', ' French ', ' German ', ' Hungarian ', ' Italian ', ' Norwegian ', ' Porter ', ' Portuguese ', ' Romanian ', ' Russian ', ' Spanish ', ' Swedish
You can use the stem function of the Snowballstemmer class to extract non-English words like this:
From nltk.stem Import Snowballstemmerfrench_stemmer = Snowballstemmer (' French ') Print (French_stemmer.stem ("French Word "))
Word Variant restore
The word variant restore is similar to stemming, but the difference is that the result of a variant restore is a real word. Unlike stemming, when you try to extract certain words, it produces similar words:
From nltk.stem Import Porterstemmerstemmer = Porterstemmer () print (Stemmer.stem (' increases '))
Results:
Increas
Now, if you use NLTK's wordnet to make a variant restore of the same word, the correct result is:
From nltk.stem Import Wordnetlemmatizerlemmatizer = Wordnetlemmatizer () print (Lemmatizer.lemmatize (' increases '))
Results:
Increase
The result may be a synonym or a different word of the same meaning.
Sometimes when you restore a word as a variant, you always get the same word.
This is because the default part of the language is the noun. To get a verb, you can specify this:
From nltk.stem Import Wordnetlemmatizerlemmatizer = Wordnetlemmatizer () print (Lemmatizer.lemmatize (' playing ', pos= "V" ))
Results:
Play
In fact, this is also a good way to compress text, resulting in only 50% to 60% of the original text.
The result can also be a verb (v), a noun (n), an adjective (a), or an adverb (R):
From nltk.stem Import Wordnetlemmatizerlemmatizer = Wordnetlemmatizer () print (Lemmatizer.lemmatize (' playing ', pos= "V" ) Print (Lemmatizer.lemmatize (' playing ', pos= "n")) Print (Lemmatizer.lemmatize (' playing ', pos= "a") print ( Lemmatizer.lemmatize (' playing ', pos= "R"))
Output:
Play
Playing
Playing
Playing
The difference between stemming and variants
Observe the following example:
From Nltk.stem import wordnetlemmatizerfrom nltk.stem Import Porterstemmerstemmer = Porterstemmer () Lemmatizer = Wordnetlemmatizer () Print (Stemmer.stem (' Stones ') print (Stemmer.stem (' speaking ')) print (Stemmer.stem (' bedroom ')) Print (Stemmer.stem (' jokes ')) print (Stemmer.stem (' Lisa ')) Print (Stemmer.stem (' Purple ')) print ('------------------- ---') print (lemmatizer.lemmatize (' stones ')) Print (Lemmatizer.lemmatize (' speaking ')) print (Lemmatizer.lemmatize (' Bedroom ') print (lemmatizer.lemmatize (' jokes ')) print (Lemmatizer.lemmatize (' Lisa ')) Print (Lemmatizer.lemmatize (' Purple '))
Output:
Stone
Speak
Bedroom
Joke
Lisa
Purpl
---------------------
Stone
Speaking
Bedroom
Joke
Lisa
Purple
Stemming does not consider context, which is why stemming is faster and less accurate than variant restores.
The individual believes that variant restores are better than stemming. Word variant restore returns a real word, even if it is not the same word, but at least it is a real word.
If you only care about speed, not the accuracy, then you can choose stemming.
The
All the steps discussed in this NLP tutorial are just text preprocessing. In future articles, you will use Python NLTK to implement text analysis.