PYTHON+NLTK Natural Language learning process two: text

Last Update:2017-06-25 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In front of the NLTK installation, we downloaded a lot of text. There are a total of 9 texts. So how do we find these texts:

Text1:moby Dick by Herman Melville 1851

Text2:sense and Sensibility by Jane Austen 1811

Text3:the Book of Genesis

Text4:inaugural Address Corpus

Text5:chat Corpus

Text6:monty Python and the Holy Grail

Text7:wall Street Journal

Text8:personals Corpus

Text9:the man is Thursday by G. K. Chesterton 1908

Just type in their names.

Print Text1
Print Text2

e:\python2.7.11\python.exee:/py_prj/nltk_study/chapter1.py

<text:moby Dickby Herman Melville 1851>

<text:sense andsensibility by Jane Austen 1811>

We can also find the words in the text.

Text1.concordance (' monstrous ')

As a result, 11 matches were found

Displaying of Matches:

Ong The former, one is of a most monstrous size. ... This came towards us,

On the Psalms. "Touching that monstrous bulk of the whale or Ork we have r

ll over with a heathenish array of monstrous clubs and spears. Some were thick

D as you gazed, and wondered what monstrous cannibal and savage could ever hav

That has survived the flood; Most monstrous and most mountainous! That Himmal

They might scout at Moby Dick as a monstrous fable, or still worse and more de

th of Radney. ' " CHAPTER of the monstrous Pictures of whales. I shall ere l

ing Scenes. In connexion with the monstrous pictures of whales, I am strongly

Ere to enter upon those still more monstrous stories of them which is to be fo

Ght has been rummaged out of this monstrous cabinet there are no telling. But

of Whale-bones; For whales of a monstrous size is oftentimes cast up dead u

None

If we want to know where the word appears in the text, for example, more at the beginning of the text, or more at the end of the text. The Dispersion_plot function is used here. TEXT4 's name is inaugural address Corpus, Chinese meaning is the meaning of the inaugural address. So there is a text of the American Presidential Inaugural address in Text4. From the text on the inside, the presidential Inaugural address from 1789 to 2009

Let's see where Citizens,democracy,freedom,duties,american appears.

Len (TEXT4)
Text4.dispersion_plot (["Citizens","Democracy", "Freedom","duties"," American "])

E:\python2.7.11\python.exe e:/py_prj/nltk_study/chapter1.py

145735

First, the length of the TEXT4 is 145735. The above scatter plot is the result of the build. Note that to get this scatter chart you must first install NumPy and matplotlib. Otherwise you will get an error when drawing.

From this scatter chart above we can see that the citizens is the most occurring place. The Chinese meaning of citizens is the meaning of the citizen and the citizens. It's also in line with America's political style. The president is speaking at the scene. Naturally, the first thing to do is to get closer to the voters. The condom is near. And as the speech went on, words like American and freedom began to rise more. After a relationship with the electorate, there is a need to start with universal values and patriotic agitation. What defends the human freedom, in order to American the powerful such words.

Let's have some more words to see: we joined China,tax,security,immigrant. China, tax, security, immigration, respectively.

Text4.dispersion_plot (["Citizens","Democracy", "Freedom","duties"," American "," China "," tax ",' security ',' immigrant ')

There's a lot less visible from the graph above, except for some security and tax. The words such as china,immigrant are basically not there. In fact, we have joined the china,tax,security,immigrant these words are the words of some specific country affairs. But in the inaugural address, there was no description at all. Therefore, we can think that the president's inaugural address is not the policy agenda, which is to be mentioned during the campaign. The inaugural address was an eloquent show.

If you want words in the text, you can use Set (TEXT4) to see what words appear in the president's inaugural address. Because the volume is too large, it is not listed here. Now that we know the total number of words and the sum of the words, we can calculate the frequency of each word appearing. The following results show that the average frequency of each word in TEXT4 is 14 times.

% Len (TEXT4)
Len (TEXT4)/Len (set (TEXT4))

E:\python2.7.11\python.exe e:/py_prj/nltk_study/chapter1.py

The length of Text4 is 145735

So how often do these words appear in these speeches? Let's take the citizens as an example. You can see that the citizizens appeared 230 times.

Text4.count (' Citizens ')

E:\python2.7.11\python.exe e:/py_prj/nltk_study/chapter1.py

230

What if we're going to find out the words that appear most in the Presidential inaugural address? Is the result of counting the words? It's too time-consuming. NLTK provides specialized functions to do this.

Fdist1=freqdist (TEXT4)
Vocabulary1=fdist1.keys ()
VOCABULARY1[:10]
Fdist1.plot (10,cumulative=true)

Freqdist is a function of statistical frequency distribution, and by Fdist1.plot we can draw the distribution of the 10 words that are used the most.

We can refine a little, how to count the words more than 500 times.

Fdist1=freqdist (TEXT4)
FDIST1[W] > 500]

[u '. ', U ' have ', U ' People ', U ' for ', U ' I ', U ' in ', U ' as ', U ' to ', U ' is ', U ' by ', U ' this ', U ' we ', U ' the ', U ' no ', u ' that ', U ' a ' , u ' the ', U '; ', U ', ', U ' is ', u ' it ', U ' ", U ' have ', u ' we ', U ' and ', U ' it ', u ' of ', U ' or ', U ' all ', U '", U ' from ', U ' their ', U ' which ', U ' 'll '

These words are more than 500 words, which can be considered as high-frequency words.

Len (W) > 15]. The results are as follows:

[u ' internationality ', U ' misappropriation ', u ' irresponsibility ', U ' enthusiastically ', U ' disqualification ', U ' Misrepresentation ', U ' misunderstanding ', U ' antiphilosophists ', U ' responsibilities ', U ' contradistinction ', U ' Transcontinental ', U ' unconstitutional ', U ' discountenancing ', U ' sentimentalizing ', U ' uncharitableness ', U ' Constitutionally ', U ' instrumentalities ', U ' responsibilities '

You can also  see the most occurrences of the words by Fdist1.max (). The result is the.

To calculate the frequency of a word  can be obtained by fdist1.freq (' internationality ')

PYTHON+NLTK Natural Language learning process two: text

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More