Python Simple crawler implementation

Source: Internet
Author: User

Purpose: Crawl Nickname

Target website: Embarrassing encyclopedia

Dependent library files:request,sys, BEAUTIFULSOUP4, imp, io

Python usage version: 3.4

Description: Reference http://cn.python-requests.org/zh_CN/latest/user/quickstart.html

Steps:

First, familiar with request

Request Description:

The Request Library is a python http library that relies on the urllib3 Library internally.

Here are the features :
Internationalized Domain names and URLs, keep-alive & Connection Pools , sessions with persistent cookies , browser-style SSL authentication , basic / Digest Identity authentication, elegant key/value cookies, automatic decompression, automatic content decoding, Unicode response body , file block upload, connection timeout, streaming download, support . NETRC, chunked requests, thread safety.

Request API Operation:

The request API is obvious to all http request types, such as the type of requests for http :

GET,POST,PUT,DELETE,HEAD and the Optionss

The corresponding request API operation is (example):

r = Requests.get (' Https://github.com/timeline.json ')

r = Requests.post ("Http://httpbin.org/post")

r = Requests.put ("Http://httpbin.org/put")

r = Requests.delete ("Http://httpbin.org/delete")

r = Requests.head ("Http://httpbin.org/get")

r = requests.options ("Http://httpbin.org/get")

This article is mainly for request to get the operation to do description:

Take the Githubhub Timeline and server response content format as an example:

1. Response Content

Import requests

r = Requests.get (' Https://github.com/timeline.json ')

R.text

Requests can be automatically decoded according to the contents of the server response, support most Unicode, of course, we can also decode the content in the specified decoding format, such as r.text before adding r.encoding = ' utf-8 '.

2. Binary response content and JSON response content

R.content

R.json ()

The two methods are called to replace the r.text above, respectively, to access the requested content in the form of a byte, rather than in the text format and decoding the content in JSON format.

3. Original response content

Import requests

r = Requests.get (' Https://github.com/timeline.json ', stream=true)

R.raw

R.raw.read (10)

#将获取的原始数据写入test. txt files

With open (' Test.txt ', ' WB ') as FD:

For chunk in R.iter_content (10):

Fd.write (Chunk)

Second, BeautifulSoup Introduction:

This is a library of Python, where the primary function is to fetch data from crawled Web content, Beautiful soup provides some simple, python-like functions for navigating, searching, and modifying analysis trees. It is a toolkit that provides users with the data they need to crawl by parsing the document, because it is simple, so it is possible to write a complete application without much code.

Third, crawl nickname

Because I am using Python for the first time, I will be the simplest crawler. The code is very simple, just get the nickname of the homepage of the Embarrassing Encyclopedia:

1 #-*-coding:utf-8-*-2  fromBs4ImportBeautifulSoup3  fromImpImportReload4 ImportRequests5 ImportSYS6 Importio7Sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding='UTF8')8 #resolves an issue where Unicode encoding is incompatible with ASCLL encoding9 #Reload (SYS)Ten #sys.setdefaultencoding ("Utf-8") One ############################ A classCrawler (object): -    def __init__(self): -             Print("Start crawling data") the #getsource get web page source code -    defGetSource (self,url): -HTML =requests.get (URL) -        #print (str (html.text)) can be printed here to see if the content is crawled +        returnHtml.text -                             +  A  at #Main function - if __name__=='__main__': -URL ='http://www.qiushibaike.com' -Testcrawler =Crawler () -Content =testcrawler.getsource (URL) -Soup =BeautifulSoup (content) inFD = open ("Crawler.txt",'W')  -     forIinchSoup.find_all ('H2'): to                 Print(I.gettext ()) +Fd.write (I.gettext () +'\ n') -Fd.close ()

  

Python Simple crawler implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.