DIY chat Robot Two-first knowledge NLTK library

Source: Internet
Author: User
Tags nltk


NLTK is an excellent natural language processing toolkit, a more important tool for our chat bots, and this section describes its installation and basic use

Please respect original, reprint please indicate source website www.shareditor.com and original link address NLTK library installation

Pip Install NLTK

Execute python and download the book:

[Root@centos #] Python
python 2.7.11 (default, Jan, 08:29:18)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clan g-700.1.81)] on Darwin
Type ' help ', ' copyright ', ' credits ' or ' license ' for more information.
>>> import NLTK
>>> nltk.download ()

Select Book after point download start download

After the download is complete, enter:

>>> from Nltk.book Import *

You will see that the books can load normally as follows:

Introductory Examples for the NLTK book * * *
Loading Text1, ..., Text9 and Sent1, ..., sent9 Type the
name of T He text or sentence to view it.
Type: ' Texts () ' or ' sents () ' to list the materials.
Text1:moby Dick by Herman Melville 1851
Text2:sense and Sensibility by Jane Austen 1811
text3:the Book of Genes is
text4:inaugural Address Corpus
text5:chat Corpus
text6:monty Python and the Holy Grail
text7:wall S Treet Journal
text8:personals Corpus
text9:the Man is Thursday by G. K. Chesterton 1908

This text* is a book node, the direct input Text1 will output the title of the book:

>>> Text1
<text:moby Dick by Herman Melville 1851>

Search Text

Perform

>>> text1.concordance ("former")

20 statement contexts that contain former are displayed

Please respect original, reprint please indicate source website www.shareditor.com and original link address

We can also search for related words, such as:

>>> text1.similar ("ship")
Whale boat sea captain the world to head time crew man and Pequod line
deck Body Fishery air boats Side voyage

I entered the ship and looked for boat, all synonyms.

We can also see where a word appears in the article:

>>> Text4.dispersion_plot (["Citizens", "democracy", "freedom", "duties", "America"])

Word Statistics

Len (Text1): Returns the total number of words

Set (Text1): Returns all Word collections for text

Len (Set (TEXT4)): Returns the total number of words in a text

Text4.count ("is"): Returns the total number of occurrences of the term "is"

Freqdist (Text1): Count the word frequency of the article and save it in a list from a large to a small sort

Fdist1 = Freqdist (Text1); Fdist1.plot (cumulative=true): statistic word frequency and output cumulative image

The longitudinal axis indicates how much the total number of words is after the words in the horizontal axes, so that these words add up to almost the total number of words in the article.

Fdist1.hapaxes (): Returns a word that appears only once

Text4.collocations (): Frequent double-linked words

Natural Language processing key points

The Chinese team defeated the United States team and China beat the American team. "Win", "defeat" a pair of antonyms, but express the same meaning: China won, the United States lost. This requires the machine to automatically analyze who wins and who is responsible.

Auto-generated language: automatic generation of language-based automatic understanding of language, without understanding can not be automatically generated

Machine translation: Now a lot of machine translation, but it is difficult to achieve the best, such as we translate Chinese into English, and then translated into Chinese, and then translated into English, round and round 10, the discovery and the initial difference is very big.

Man-Machine Dialogue: This is what we want to achieve the ultimate goal, here is a "Turing test" method, that is, in 5 minutes to answer the question of 30% is passed, can pass the thought of intelligence.

Natural language processing is divided into two factions, the school is based on rules, that is, completely from the syntactic syntax and so on, according to the rules of the language analysis and processing, which in the last century experienced many years of trial failure, because there are too many rules, and many languages do not follow the pattern out of cards, imagine you chase your shadow, you run faster he You'll never catch up with it. Another school is based on statistics, that is, to collect a large number of corpus data, through the way of statistical learning to understand the language, which is more and more attention in the contemporary and has become a trend, because with the development of hardware technology, big data storage and computing is not a problem, no matter what the rules, language is a statistical law, Of course, based on the statistical shortcomings, that is, "small probability events will never happen" cause there are always some problems can not be solved.

In the next section we will solve the corpus problem based on the statistical scheme.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.