Purpose: Crawl Nickname
Target website: Embarrassing encyclopedia
Dependent library files:request,sys, BEAUTIFULSOUP4, imp, io
Python usage version: 3.4
Description: Reference http://cn.python-requests.org/zh_CN/latest/user/quickstart.html
Steps:
First, familiar with request
Request Description:
The Request Library is a python http library that relies on the urllib3 Library internally.
Here are the features :
Internationalized Domain names and URLs, keep-alive & Connection Pools , sessions with persistent cookies , browser-style SSL authentication , basic / Digest Identity authentication, elegant key/value cookies, automatic decompression, automatic content decoding, Unicode response body , file block upload, connection timeout, streaming download, support . NETRC, chunked requests, thread safety.
Request API Operation:
The request API is obvious to all http request types, such as the type of requests for http :
GET,POST,PUT,DELETE,HEAD and the Optionss
The corresponding request API operation is (example):
r = Requests.get (' Https://github.com/timeline.json ')
r = Requests.post ("Http://httpbin.org/post")
r = Requests.put ("Http://httpbin.org/put")
r = Requests.delete ("Http://httpbin.org/delete")
r = Requests.head ("Http://httpbin.org/get")
r = requests.options ("Http://httpbin.org/get")
This article is mainly for request to get the operation to do description:
Take the Githubhub Timeline and server response content format as an example:
1. Response Content
Import requests
r = Requests.get (' Https://github.com/timeline.json ')
R.text
Requests can be automatically decoded according to the contents of the server response, support most Unicode, of course, we can also decode the content in the specified decoding format, such as r.text before adding r.encoding = ' utf-8 '.
2. Binary response content and JSON response content
R.content
R.json ()
The two methods are called to replace the r.text above, respectively, to access the requested content in the form of a byte, rather than in the text format and decoding the content in JSON format.
3. Original response content
Import requests
r = Requests.get (' Https://github.com/timeline.json ', stream=true)
R.raw
R.raw.read (10)
#将获取的原始数据写入test. txt files
With open (' Test.txt ', ' WB ') as FD:
For chunk in R.iter_content (10):
Fd.write (Chunk)
Second, BeautifulSoup Introduction:
This is a library of Python, where the primary function is to fetch data from crawled Web content, Beautiful soup provides some simple, python-like functions for navigating, searching, and modifying analysis trees. It is a toolkit that provides users with the data they need to crawl by parsing the document, because it is simple, so it is possible to write a complete application without much code.
Third, crawl nickname
Because I am using Python for the first time, I will be the simplest crawler. The code is very simple, just get the nickname of the homepage of the Embarrassing Encyclopedia:
1 #-*-coding:utf-8-*-2 fromBs4ImportBeautifulSoup3 fromImpImportReload4 ImportRequests5 ImportSYS6 Importio7Sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding='UTF8')8 #resolves an issue where Unicode encoding is incompatible with ASCLL encoding9 #Reload (SYS)Ten #sys.setdefaultencoding ("Utf-8") One ############################ A classCrawler (object): - def __init__(self): - Print("Start crawling data") the #getsource get web page source code - defGetSource (self,url): -HTML =requests.get (URL) - #print (str (html.text)) can be printed here to see if the content is crawled + returnHtml.text - + A at #Main function - if __name__=='__main__': -URL ='http://www.qiushibaike.com' -Testcrawler =Crawler () -Content =testcrawler.getsource (URL) -Soup =BeautifulSoup (content) inFD = open ("Crawler.txt",'W') - forIinchSoup.find_all ('H2'): to Print(I.gettext ()) +Fd.write (I.gettext () +'\ n') -Fd.close ()
Python Simple crawler implementation