Python Simple crawler implementation

Last Update:2016-12-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Purpose: Crawl Nickname

Target website: Embarrassing encyclopedia

Dependent library files:request,sys, BEAUTIFULSOUP4, imp, io

Python usage version: 3.4

Description: Reference http://cn.python-requests.org/zh_CN/latest/user/quickstart.html

Steps:

First, familiar with request

Request Description:

The Request Library is a python http library that relies on the urllib3 Library internally.

Here are the features :
Internationalized Domain names and URLs, keep-alive & Connection Pools , sessions with persistent cookies , browser-style SSL authentication , basic / Digest Identity authentication, elegant key/value cookies, automatic decompression, automatic content decoding, Unicode response body , file block upload, connection timeout, streaming download, support . NETRC, chunked requests, thread safety.

Request API Operation:

The request API is obvious to all http request types, such as the type of requests for http :

GET,POST,PUT,DELETE,HEAD and the Optionss

The corresponding request API operation is (example):

r = Requests.get (' Https://github.com/timeline.json ')

r = Requests.post ("Http://httpbin.org/post")

r = Requests.put ("Http://httpbin.org/put")

r = Requests.delete ("Http://httpbin.org/delete")

r = Requests.head ("Http://httpbin.org/get")

r = requests.options ("Http://httpbin.org/get")

This article is mainly for request to get the operation to do description:

Take the Githubhub Timeline and server response content format as an example:

1. Response Content

Import requests

r = Requests.get (' Https://github.com/timeline.json ')

R.text

Requests can be automatically decoded according to the contents of the server response, support most Unicode, of course, we can also decode the content in the specified decoding format, such as r.text before adding r.encoding = ' utf-8 '.

2. Binary response content and JSON response content

R.content

R.json ()

The two methods are called to replace the r.text above, respectively, to access the requested content in the form of a byte, rather than in the text format and decoding the content in JSON format.

3. Original response content

Import requests

r = Requests.get (' Https://github.com/timeline.json ', stream=true)

R.raw

R.raw.read (10)

#将获取的原始数据写入test. txt files

With open (' Test.txt ', ' WB ') as FD:

For chunk in R.iter_content (10):

Fd.write (Chunk)

Second, BeautifulSoup Introduction:

This is a library of Python, where the primary function is to fetch data from crawled Web content, Beautiful soup provides some simple, python-like functions for navigating, searching, and modifying analysis trees. It is a toolkit that provides users with the data they need to crawl by parsing the document, because it is simple, so it is possible to write a complete application without much code.

Third, crawl nickname

Because I am using Python for the first time, I will be the simplest crawler. The code is very simple, just get the nickname of the homepage of the Embarrassing Encyclopedia:

1 #-*-coding:utf-8-*-2  fromBs4ImportBeautifulSoup3  fromImpImportReload4 ImportRequests5 ImportSYS6 Importio7Sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding='UTF8')8 #resolves an issue where Unicode encoding is incompatible with ASCLL encoding9 #Reload (SYS)Ten #sys.setdefaultencoding ("Utf-8") One ############################ A classCrawler (object): -    def __init__(self): -             Print("Start crawling data") the #getsource get web page source code -    defGetSource (self,url): -HTML =requests.get (URL) -        #print (str (html.text)) can be printed here to see if the content is crawled +        returnHtml.text -                             +  A  at #Main function - if __name__=='__main__': -URL ='http://www.qiushibaike.com' -Testcrawler =Crawler () -Content =testcrawler.getsource (URL) -Soup =BeautifulSoup (content) inFD = open ("Crawler.txt",'W')  -     forIinchSoup.find_all ('H2'): to                 Print(I.gettext ()) +Fd.write (I.gettext () +'\ n') -Fd.close ()

Python Simple crawler implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Simple crawler implementation

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support