0 Basic self-study with Python 3 development web crawler (iii): Disguise browser June

Source: Internet
Author: User
Tags prefetch

Source: Jecvay Notes (@Jecvay)

The last time I taught myself a reptile, I wrote a primitive, barely-able-to-run reptile Alpha. There are many problems with the alpha version. For example, a website can not, the crawler has been waiting for the connection to return response, do not know the timeout skipped; or some web site specifically to intercept the crawler, our crawler will not disguise themselves as a regular browser force; And the crawled content is not saved locally, and has little effect. This time we solve these small problems.

Also, when I wrote the second installment of this series, I was a person who knew nothing about HTTP GET and post and response these nouns, but I didn't think it was a good idea to write a reptile. So I refer to the << computer network-the top-down approach >> most of the second chapter of the book. If you don't know anything about the HTTP mechanism, I recommend you look for the information. In the process of looking, install a software called Fiddler, learn to practice, observe how the browser accesses a website, how to make a request, how to handle the response, how to jump, and even how to login authentication. There is an old saying that is good, the more the use of Fiddler, the more profound understanding of the theory; The more you understand the theory, the more fiddler you use. In the end, when we were doing all sorts of things with crawlers, Fiddler was always one of the best assistants.

Add Timeout Skip Feature

First, I'm simply going to

Urlop = Urllib.request.urlopen (URL)

Switch

Urlop = Urllib.request.urlopen (URL, timeout = 2)

After running the discovery, when a timeout occurs, the program is interrupted because of exception. So I put this sentence also in try. Except structure, problem solving.

Support Auto Jump

When climbing http://baidu.com, crawling back a thing with no content, this thing tells us should jump to http://www.baidu.com. However, our crawler does not support automatic jump, now we add this function, so that the crawler crawl baidu.com when the ability to crawl www.baidu.com content.

First of all we need to know how to crawl http://baidu.com when he returned to the page, this we can either use Fiddler to see, can also write a small crawler to crawl. Here I caught the content below, you should also try to write a few lines of Python to catch a catch.

<meta http-equiv= "Refresh" content= "0;url=http://www.baidu.com/" >

Look at the code we know this is a meta to use HTML to refresh and redirect the code, where 0 is to wait 0 seconds after the jump, that is, jump immediately. So we can get to the right place by extracting the URL with a regular expression, as we said last time. In fact, the last time we wrote a crawler can have this function, here is just a separate to illustrate the HTTP meta-jump.

Camouflage Browser Regular Army

There are few small things written in the front. Now examine in detail how to make our Python crawler a regular browser for Web sites to visit. Because if you do not disguise yourself, some sites will not climb back. If you have seen theoretical knowledge, you know that we are going to add user-agent to the header at GET.

If you have not read the theoretical knowledge, follow the following keyword search study it:D

    • There are two types of HTTP messages: Request messages and response messages
    • Request line and header line for request message
    • GET, POST, HEAD, PUT, DELETE method

When I visit Baidu homepage with IE Browser, the request message sent by the browser is as follows:

GET http://www.baidu.com/HTTP/1.1
Accept:text/html, Application/xhtml+xml, */*
accept-language:en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3
user-agent:mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) Like Gecko
Accept-encoding:gzip, deflate
Host:www.baidu.com
Dnt:1
Connection:keep-alive
cookie:baiduid=57f4d171573a6b88a68789ef5ddfe87:fg=1; UC_LOGIN_UNIQUE=CCBA6E8D978872D57C7654130E714ABD; bd_upn=11263145; BD

After Baidu received this message, the response message returned to me is as follows (abridged):

http/1.1 OK
Date:mon, Sep 13:07:01 GMT
content-type:text/html; charset=utf-8
Connection:keep-a Live
vary:accept-encoding
cache-control:private
Cxy_all:baidu+8b13ba5a7289a37fb380e0324ad688e7
Expires:mon, Sep 13:06:21 GMT
x-powered-by:hphp
server:bws/1.1
bdpagetype:1
Bdqid: 0x8d15bb610001fe79
bduserid:0
set-cookie:bdsvrtm=0; path=/
set-cookie:bd_home=0; path=/
content-length:80137

<! DOCTYPE html><!--STATUS ok-->

If you can understand this paragraph of the first sentence is OK, and other can later cooperate with Fiddler slowly study. So we have to do is in the Python crawler to Baidu when the request, by the way in the request to write a user-agent, indicating that they are browser June.

There are a number of ways to add headers when GET, and here are two methods.

The first method is simple and straightforward, but does not extend the function, the code is as follows:

123456789101112 import urllib. Request URL = ' http://www.baidu.com/' req = urllib. Request. Request(url, headers = { ' Connection ': ' keep-alive ', ' Accept ': ' text/html, Application/xhtml+xml, */* ', ' accept-language ': ' en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3 ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) like Gecko ' })oper = urllib. Request. Urlopen(req) data = oper. Read() Print(data. Decode())

The second method uses the Build_opener method, which has the advantage of using the opener to customize the functionality, such as the following code, which expands the ability to automatically process Cookies.

1234567891011121314151617181920212223 import urllib. Request Import http. Cookiejar # Head:dict of headersdef Makemyopener(head = { ' Connection ': ' keep-alive ', ' Accept ': ' text/html, Application/xhtml+xml, */* ', ' accept-language ': ' en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3 ', ' user-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) like Gecko ' }):    cj = http.cookiejar. Cookiejar () opener = urllib. Request. Build_opener(urllib. Request. Httpcookieprocessor(CJ)) header = [] for key, value in head. Items(): elem = (key, value) header. Append(elem) opener. Addheaders = header return opener oper = makemyopener() uop = oper. Open(' http://www.baidu.com/', timeout = + ) data = uop. Read() Print(data. Decode())

The GET message captured by Fiddler after the above code is run is as follows:

GET http://www.baidu.com/HTTP/1.1
Accept-encoding:identity
Connection:close
Host:www.baidu.com
user-agent:mozilla/5.0 (Windows NT 6.3; WOW64; trident/7.0; rv:11.0) Like Gecko
Accept:text/html, Application/xhtml+xml, */*
accept-language:en-us,en;q=0.8,zh-hans-cn;q=0.5,zh-hans;q=0.3

It can be seen that everything we write in the code is added to the request message.

Save the captured messages

By the way, the file operation. Python's file operation is quite handy. We can talk about it. Data is stored in binary form, and can be stored as text after decode () is processed as a string. Change the way you open files to save files in different poses. Here is the reference code:

123456789101112 def saveFile(data): save_path = ' D:\temp.out ' f_obj = open (save_path, ' WB ') # WB means opening the way f_obj. Write(data) f_obj. Close() # Skip the crawler code here# ...# The data crawled into the DAT variable# Save the DAT variable to the D drivesaveFile(dat)

Next we'll use Python to crawl the information that needs to be logged in to see. Before that, I had a little familiarity with Fiddler. Hope to learn together in advance to install a Fiddler play a bit.

0 Basic self-study with Python 3 development web crawler (iii): Disguise browser June

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.