-Preface
have been using scrapy and urllib posture Crawl data, recently used requests feel good, this time hope through the data to crawl for you crawler enthusiasts and beginners better understanding of the preparation process and requests request mode of operation and related issues. Of course this is a simple reptile project, I will focus on the crawler from the beginning of the preparation process, the purpose is to let me as self-taught reptile enthusiasts and beginners more understanding of the crawler work.
I. Observing the target page templates and strategies Many people neglect this step, which is most important because it determines what strategy you will take to get the data, and also assess how much you can do (1) Open the browser's development tool F12
Here I use the Google Browser, open the browser press F12, you will see how you load the page, as well as the way the network requests and the parameters of the interactive situation. If you do not see, you should Baidu own browser developer tools, how to open. After we opened the Web portal, F12 saw nothing in the network column of the developer tools. 1.1 is shown below:
Developer Tools Figure 1.1
Then we enter what needs to be searched in the search box, and you will see the changes in the background and foreground data, the loading data and the way and parameters of the data request. 1.2:
Server-side interaction with the browser infographic 1.2
Here you can see that there are a lot of JS files and PNG format files, these files are all through your search this action, the other server returns files, according to these you can better understand the Web service side with the browser interaction process. If you are experienced here, you can quickly find the interaction file you want based on its size and name fields.
Because of our previous search operations, it is easy to see that the first file with the search field is the one that interacts with the Web server when searching. Click we can see 1.3:
Request parameters for communication with server Figure 1.3
Here is the process of returning to our communication with the server and the related data, the upper right can see headers, previes, Response, cookies and other options.
-
Headers can see the parameters of the request, we often write the crawler access server is rejected because there are a lot of parameter validation is not passed, so learn to use the parameters here is very necessary. Requests Headers is the request header you use when you request, it is important that cookies and user-agent,cookie allow you to avoid landing, although the timeliness, but sometimes help you save time, and it is sometimes the site to monitor whether you are a crawler means, UserAgent are basically detected by the server, so each request needs to be brought in. Query String parameters is the parameter you take when you request it, because here is the Get method request, so the parameters here are visible in your request link. Post request, the parameters here are important, each request must take these parameters, such as account login, you can see the account number and password.
-
Previes This file content is what the site will be presented to you, that is, preview
-
Response can see the returned data text, sometimes here can directly see the JSON data, when resolving dynamic page, If you can directly find the data you want, you can easily avoid the JS file, directly access the file to obtain data.
After the homepage search, you will find that there is no page change, through the continuous pull down, there will be new data generated. Here we can see from the drop-down that it produces a new file 1.4:
Drop-down post-change diagram 1.4
More than one search request file, we opened and the first comparison found that The offset field changed from 0 to 10. We copy the URL here after pasting on the newly opened tab and find 1.5:
Copy linked page diagram 1.5
See the returned JSON type of data, Here Chinese are Unicode encoding format, so we need to write some code, copy these strings, decode decoding the Chinese. After decoding you will find that these are the data displayed on the page when the drop-down. We can extract the data we need from here through regular expressions, here is just the first layer of the page, in order to enter the second layer (click on a link, jump to the next page), we will extract all the links here.
Review my previous article requests introduction to use, it is easy to write:
#假设每个搜索项有500页, if you don't have what you want, you can breakfor i in range: # key is the keyword to search url = ' https://www.zhihu.com/r/search?q= ' +key+ ' &correction=1&type=content&offset= ' +str (i*10) try: # The try is used here, and the reason behind it is explained response = Requests.get (url,headers=headers) except: continue response.encoding= ' Unicode-escape ' page = lxml.html.fromstring (response.text) rule = re.compile (' <a target= ' _blank " href=" ( .*?)"‘) hrefs = re.findall (Rule,response.text)
All right, first floor, we're almost done, go to the second floor of the website, click on a title of our search results, jump to a new page, we use the same method to observe the second layer of information we interact with the server 1.6:
Second-tier server interaction infographic 1.6
There are a lot of network loaded files, we need to find the key information of the file, if you have tried to directly request you open the URL link in the program, you will find that what you get is not what you see on the Internet, because many files are rendered by the browser JS and the data obtained, The code we write doesn't have this step, so we don't see the data when we use a program to access dynamic pages.
Of course you can also add the JS file inside the code to trigger, but this will become very complicated, because the JS file a lot of functions, you want to find the JS file you want is not easy. We need to find a faster and more effective way, which is why "observation" is the most important part of the crawler.
Here we start a file to see, first look at response, you can see his return data and content. There is the file name, here answers can easily guess is comments and answers, we click into response see the data, found that really as we think
1.7:
Answer request Parameters page Figure 1.7
With the last experience, you can know exactly how offset is used to turn the page, limit is the number of answers, we also copy the link, and the first layer opens a new tab, you can see the JSON data returned. All we have to do is parse the work and extract the relevant data.
Second, the production of reptiles
Through observation, we can find the request portal of the server, find the link from the first layer, and then get the data from the second layer. The second layer actually has several layouts, one is our type of question and answer, and the form of the article,/question/and Zhuanlan, as well as video, which are to be discussed separately, respectively. I only introduce questions and answers and the analytic way of the article, the methods are very similar.
(1) Search keywords to enter the relevant page
Import requestsimport reimport lxml.html keys = [' ... #key Define yourself a list for key in keys: #假设有500页 for i in range (: ) url = ' https://www.zhihu.com/r/search?q= ' +key+ ' & Correction=1&type=content&offset= ' +str (i*10) try: response = requests.get (url,headers=headers) except: continue response.encoding= ' Unicode-escape ' page = lxml.html.fromstring (Response.text) rule = Re.compile (' <a target= ' _blank " href=" (. *?) ") hrefs = re.findall (Rule,response.text) rule2 = re.compile (' <em> (. *?) < ') keyword = re.findall (rule2,response.text ) response.close () hrefs = [i.replace (' \ \ ', ') for i in hrefs] if key in keyword: for href in hrefs: if ' Question ' in href: num = href.replace ('/question/', ') getquestionmsg (Key, NUM) elif ' Zhuanlan ' in href: getpagemsg (KEY,HREF) else: break
(2) Article style extract data
Def getpagemsg (key,url): headers = { ' User-Agent ' : ' mozilla/5.0 (linux; android 6.0; nexus 5 build/mra58n) AppleWebKit/ 537.36 (Khtml, like gecko) chrome/65.0.3325.181 mobile safari/537.36 ' } try: response = requests.get (url,headers=headers) except: return #print (Response.status_code) print (key) #这里我用xpath方式来解析网页文本 page = lxml.html.fromstring ( Response.text) title = page.xpath ('//title/text () ') Title = title[0] content = page.xpath ('//div[@class = "RichText Post-richtext "]//p/text ()")
(3) Question and answer template extract data
Def getquestionmsg (key,num): #这里假设它有500页 for i in range (500): url = ' https://www.zhihu.com/api/v4/questions/' +num+ '/answers?include=data%5b*%5d.is_normal%2cadmin_closed_comment%2creward_info%2cis_collapsed%2cannotation_ Action%2cannotation_detail%2ccollapse_reason%2cis_sticky%2ccollapsed_by%2csuggest_edit%2ccomment_count%2ccan_ Comment%2ccontent%2ceditable_content%2cvoteup_count%2creshipment_settings%2ccomment_permission%2ccreated_time% 2cupdated_time%2creview_info%2crelevant_info%2cquestion%2cexcerpt%2crelationship.is_authorized%2cis_author% 2cvoting%2cis_thanked%2cis_nothelp%2cupvoted_followees%3bdata%5b*%5d.mark_infos%5b*%5d.url%3bdata%5b*% 5d.author.follower_count%2cbadge%5b%3f (type%3dbest_answerer)%5d.topics&offset= ' +str (i*20) + ' &limit=20 &sort_by=created ' headers = { ' user-agent ': ' mozilla/5.0 (linux; android 6.0; nexus 5 build/ mra58n) AppleWebKit/537.36 (Khtml, like gecko) Chrome/65.0.3325.181 Mobile safari/537.36 ' } #自行添加自己的cookie值, follow the steps in the first step to copy the Add cookie in Requests headers that accesses the page here cookies={ ' _zap ': ', ' q_c1 ': ', ' d_c0 ': ', ' z_c0 ': ', ' __dayu_pp ': ', ' q_c1 ': ', ' &NBSP;ALIYUNGF_TC ': ', ' _xsrf ': ', ' &NBSP;__UTMC ': ', ' &NBSP;__UTMV ': ', ' &NBSP;__UTMZ ' : ', ' __utma ': ', ' s-q ': ', ' sid ': ', ' s-i ': ', } try: &nbSp; response = requests.get (url,headers=headers,cookies=cookies) except: return response.encoding= ' Unicode-escape ' print (Key) rule = re.compile (' "Excerpt": "(. *?)") content = re.findall (Rule,response.text) rule2 = re.compile (' "title": "(. *?)") title = re.findall (Rule2,response.text) content = ', '. Join (content) response.close () try: &Nbsp; print (title[1]) title = title[1] Print (content) except: return
The question and answer time, added the cookies, because the direct access is not the data, so we need to simulate our observation of the request header information, added a cookie, found that the visit was successful.
Third, testing and problems
In the process of crawling with requests, you often encounter an anomaly
Requests.exceptions.ConnectionError:HTTPSConnectionPool:Max retries exceeded with URL:
Baidu explained that the number of requests connection requests exceeded the limit number of times, need to close the connection or set a larger number of default connections, but I have tried, still have this problem. I think it should be another reason for this error, so I have to add a try every time response= response.get (), throw an exception to ensure that the program continues to run. If you have a better solution and an understanding of this problem, please reply to me or private messages, I am not very grateful.
ConclusionElectric Stacker Truck
The crawler project to the end of this, the earlier work is relatively fine, because I think as a crawler engineer on the pre-work should pay attention to, I did not pay attention to, so filled a lot of pits, took a lot of detours, so borrow this project in the early work of the crawler engineer, of course, I said just a small piece, Some of the actual work will be more complex, sometimes find a long time can not find access to data entry, sometimes found a way to obtain the parameter verification, and even finally get to the data, only to find that the site can provide more than enough data, which is the pre-work needs to do a good job. This chapter dedicated to the same as I self-taught reptile enthusiasts and beginners.
Python crawler project (beginner's tutorial) (requests mode)