A summary of urllib and URLLIB2 crawler camouflage

Source: Internet
Author: User

On the site management point of view, if at the same time, we all use the crawler to crawl their own site, then this site server can withstand this load? Must not ah, if the severe overload will be the server down (the crash), for some commercial sites, downtime for a second loss is very bad, this is not an administrator can bear, right? What does the Administrator do to optimize the Web server? I think, write a script, when the detection of an IP access is too fast, the message header is not a browser, then the denial of service, or block the IP, and so on, so that the server can reduce the burden and let the server normal.

So since the server is optimized, but you know this is the optimization of the crawler, if you are using a browser as a user access, the server will not intercept or block you, it is not afraid to stop you, why, you are now a customer, it dares to stop the customer, do not want to continue to operate it? Therefore, if it is a browser user, it can be accessed normally.

So you think of a way? Yes, the program forged into a browser Ah, said the server will detect the message header information, if the browser is not normal traffic, if the program is denied service.

So what's the message? What is the head information? Detailed does not explain, this involves the HTTP protocol and TCP/IP three times handshake and so on the network basic knowledge, interested in own Baidu or Google bar.

This blog post is not far away, just say the relevant focus-how to view the head information.

User-agent

In fact, I think some friends may be wondering, how does the server know that we are using a program or a browser? What is it used to judge?

I am using Firefox, right mouse button-view elements (some browsers are censorship elements or check)

Network (some networks):

Message appears:

Double-click on it, the right side will appear the detailed information, select the message header (some headers)

Locate the request headers, where the user-agent is our header information:

See is displayed

user-agent:mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0. This is our head information. See no, the information displayed, using the operating system kernel is 6.1, corresponding to the Win7 64-bit, with the firefox56, is the Firefox version 56 so the browser access can be detected, naturally we use a program access as can be detected, no matter how he identified. In fact, if I was an administrator (if I did my major, it would really be an administrator), according to my idea, I directly define a legitimate message header is "mozilla/5.0 ..." and so on, except these other all the rejection is OK? The front only uses the Urlopen method of the Urllib module to open the Web page, there is a function under it to view the header information, but there are some differences with the browser view. Note: The following code is in Python2, the Python3 does not exist Urllib and URLLIB2 modules, in Python3, the relevant Urllib module into a package, all under the Urllib package
#-*-Coding:utf-8-*-import urlliburl= ' http://www.baidu.com ' #百度网址html =urllib.urlopen (URL) # Use the Urlopen method in the module Urllib to open the Web page print (dir (HTML))  #查看对象html的方法print (Urllib.urlopen) #查看对象urllib. Urlopen Method Print ( Urllib.urlopen ()) #查看对象urllib. Urlopen method After instantiation

Results:
Traceback (most recent):  File "D:\programme\PyCharm 5.0.3\helpers\pycharm\utrunner.py", line 121, < module>    modules = [Loadsource (a[0])]  File "D:\programme\PyCharm 5.0.3\helpers\pycharm\utrunner.py", line In Loadsource    module = Imp.load_source (ModuleName, fileName)  File "G:\programme\Python\python project\ test.py ", line 8, in <module>    print (Urllib.urlopen ()) Typeerror:urlopen () takes at least 1 argument (0 given) [' _ _doc__ ', ' __init__ ', ' __iter__ ', ' __module__ ', ' __repr__ ', ' close ', ' Code ', ' Fileno ', ' FP ', ' GetCode ', ' geturl ', ' headers ', ' info ', ' next ', ' read ', ' ReadLine ', ' readlines ', ' url ']<function urlopen at 0x0297fd30>
Thus, it is important to note that only if Urllib.urlopen () is instantiated after its subsequent method, Urlopen will return a class file object, Urllib.urlopen is just an object, and Urllib.urlopen () must pass in a parameter, Otherwise the error. In the previous chapter, a valid instantiation of the Read () method has been used to read the resulting web site information.

Urlopen provides the following methods:

    • Read (), ReadLine (), ReadLines (), Fileno (), Close (): These methods are used exactly like file objects
    • info (): Returns a httplib. Httpmessage object that represents the header information returned by the remote server
    • GetCode (): Returns the HTTP status code. In the case of an HTTP request, 200 indicates that the request completed successfully; 404 indicates that the URL was not found
    • Geturl (): Returns the requested URL
    • Headers: Return request header information
    • Code: Return Status Code
    • URL: Returns the requested URL

Ok, detailed himself to study, the focus of this blog has finally come, disguised a head information

Forge Header Information

Since Urllib does not have a way of forging head information, a new module is used here, URLLIB2

#-*-Coding:utf-8-*-import urllib2url= ' http://www.baidu.com ' head={' user-agent '    : ' mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 '}  #头部信息, must be a dictionary html=urllib2. Request (url,headers=head) result=urllib2.urlopen (HTML) print Result.read ()

Or you can do this:

#-*-Coding:utf-8-*-import urllib2url= ' http://www.baidu.com ' html=urllib2. Request (URL) html.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) gecko/20100101 firefox/56.0 ')  #此时注意区别格式result =urllib2.urlopen (HTML) print Result.read ()

The result is the same, I will not show it.

The request header information can be forged according to the above method.

Then you say, how do I know the forgery is successful? And what exactly is displayed when you don't forge a head message? Introduce a grab Bag tool--fidder, and use this tool to see exactly what the message header is. Here does not show, oneself go down common sense. And I can assure you that the forgery was successful.

In this way, we will upgrade the crawler code, can handle the normal anti-crawler restrictions

A summary of urllib and URLLIB2 crawler camouflage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.