Natural language Processing 3.1--access to text from the network and hard disk

Source: Internet
Author: User
Tags nltk

The most important source of text is undoubtedly the network. Exploring ready-made text collections is handy, but everyone has their own source of text and needs to learn how to access them.

First, we want to learn to access text from the network and hard disk.

1. ebook

A small sample text of the Gutenberg project in the NLTK Corpus collection, if you are interested in the other text of the Gutenberg project, you can browse other books on the http://www.gutenberg.org/catalog/

The following is an example of the text "Crime and Punishment" in number 2554, which gives a brief introduction to how to access the Python

-*-encoding:utf-8-*-from urllib.request import urlopenimport nltkurl=r ' http://www.gutenberg.org/files/2554/2554. TXT ' raw=str (urlopen (URL). Read (), encoding= ' Utf-8 ') print (type (raw))

At this point the output is <class ' str ' >

>>>print (raw[:75]) the Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky
>>>print (len (Raw))
1176831

The variable raw contains 1,176,831 characters, which is the original content of the book, but there are many details that we do not relate to, such as spaces, line breaks, and so on. For language processing, to break up a string into words and punctuation, we become a word breaker that produces a list of words and punctuation.

>>>token=nltk.word_tokenize (Raw) >>>print (token) <class ' list ' >>>>print ( Len (token)) 254352>>>print (token[:10]) [' The ', ' Project ', ' Gutenberg ', ' EBook ', ' of ', ' Crime ', ' and ', ' Punishment ', ', ', ' by ']

Note that NLTK requires a word breaker, but the URL read-in string task that was previously opened does not have a word breaker. If you further create NLTK text in the linked list, you can do some regular list operations, such as slicing

>>>TEXT=NLTK. Text (token) >>>print (text[1020:1060]) [' and ', ' punishment ', ' part ', ' I ', ' CHAPTER ', ' I ', ' in ', ' an ', ' Exceptionally ', ' hot ', ' evening ',

' In ', ' s. ', ' Place ', ' and ', ' walked ', ' slowly ', ', ', ' as ', ' though ', ' in ', ' hesitation ', ', ', ' Towards ', ' K ']>>> Print (Text.collocations ()) Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotyaromanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; Oldwoman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;great deal; Nikodim Fomitch; Young Mans; Ilya Petrovitch; n ' t know; Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market

Here we introduce the Find and RFind functions

For example, the text downloaded by the Gutenberg project contains a header with the name of the text, the author, and so on. So before selecting the content in the original text, you need to manually examine the file to discover the specific string that marks the beginning and end of the content.

>>>start=raw.find (' part I ') >>>end=raw.rfind ("End of Project Gutenberg ' s Crime") >>>raw= Raw[start:end]>>>print (Raw.find (' part I ')) 0

The function find () and RFind () (reverse find) are used to get the index value where the string slice resides.

Natural language Processing 3.1--access to text from the network and hard disk

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.