Python web crawler (iv)

Source: Internet
Author: User
Tags python web crawler

About the Robots protocol

Robots protocol, also known as the Crawler protocol, is a web crawler exclusion standard (Robots exclusion Protocol), used to tell crawlers and search engines which pages can be crawled and which cannot be crawled. Because if we use the crawler crawl information without restrictions, and do not say technically can break through some web site of the crawler, if there is no limit to crawl, coupled with a distributed and multi-threaded, it may lead to the visit site to run down (although the probability is very small) But it also shows that we need to normalize our crawlers and only crawl the data we need to be given, so that we don't violate some laws.
We can add this site to any site to /robots.txt see if there are any restrictions on crawlers, here is an example of the following: https://www.zhihu.com/robots.txt

User-agent and Disallow,disallow Specify a directory that is not allowed to crawl, and the meaning of the knowledge is to prohibit all crawlers from accessing the directories listed below.

Urllib's Robotparser

We can use the Robotparser module to parse the Robots.txt,robotparser module to provide a class called Robotfileparser. It can be based on a site's robots.txt file to determine whether a crawl crawler has permission to crawl this page.

urllib.robotparseR.RobotFileParser(url='')#只需要在构造方法中传入robots.txt的链接就可以了#也可以是默认为空,然后使用set_url()方法进行设置。
About requests

We use Urllib to deal with web authentication, processing cookies are required opener, handler to deal with, but the requests library has a more powerful usage.
The Urlopen of the Urllib library actually enables a Web page to be requested in a get way, whereas in requests we use the Get () method directly. Other types like post or request head can be used directly with Requests.post or Requests.head methods.

GET request

r = requests.get(url,params=data,headers=headers)
So the URL of the request is actually url+data, in addition, the return type of the Web page is JSON format, although the return is actually the STR type, but is in JSON format, so if we want to directly parse the return result to a dictionary format, we can directly call the JSON () Method. In this way, a string that returns the result in JSON format can be converted into a dictionary form in Python.

File Upload and download

Requests can be used to simulate the submission of some data:

import requestsfiles={'file':open('favicon.ico','rb')}#文件必须和当前脚本在同一目录下r=requests.post(url,files=files)print(r.text)

Similarly, you can download files using requests:

import requestsr = requests.get("https://github.com/favicon.ico")with open('favicon.ico', 'wb') as f:    f.write(r.content)
Cookies

Much simpler than URILLB, simply visit the rrequests cookie type to access Requestscookiejar:

import requestsr = requests.get('https://www.baidu.com')print(r.cookies)for key, value in r.cookies.items():    print(key + '=' + value)

We can always keep the login status, save the cookies on the webpage, and then write to headers for sending:

import requestsheaders = {    'Cookie': 'q_c1=31653b264a074fc9a57816d1ea93ed8b|1474273938000|1474273938000; d_c0="AGDAs254kAqPTr6NW1U3XTLFzKhMPQ6H_nc=|1474273938"; __utmv=51854390.100-1|2=registration_date=20130902=1^3=entry_date=20130902=1;a_t="2.0AACAfbwdAAAXAAAAso0QWAAAgH28HQAAAGDAs254kAoXAAAAYQJVTQ4FCVgA360us8BAklzLYNEHUd6kmHtRQX5a6hiZxKCynnycerLQ3gIkoJLOCQ==";z_c0=Mi4wQUFDQWZid2RBQUFBWU1DemJuaVFDaGNBQUFCaEFsVk5EZ1VKV0FEZnJTNnp3RUNTWE10ZzBRZFIzcVNZZTFGQmZn|1474887858|64b4d4234a21de774c42c837fe0b672fdb5763b0',    'Host': 'www.zhihu.com',    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',}r = requests.get('https://www.zhihu.com', headers=headers)print(r.text)
Session Maintenance

Before doing their own school information crawl, but each run the program can not catch the data, the error message promised to detect that there is a duplicate login phenomenon. It is clear that the code landed successfully and then continue to get () to request, how can it be wrong?
In fact, the requests used a few get () or other methods, the equivalent of opening a new browser, they are completely irrelevant, so there is no first post () successfully simulated login, the second get () is on the basis of successful emulation login to continue to operate, Instead, a browser is opened to make a new operation, so an error occurs.
The workaround is to set up the same cookies at two requests, which is feasible, but cumbersome, which destroys the simplicity of the code. So here we need to maintain the same session window, using the Session object.

import requestss = requests.Session()s.get('http://httpbin.org/cookies/set/number/123456789')r = s.get('http://httpbin.org/cookies')print(r.text)

The returned result is:

{  "cookies": {    "number": "123456789"  }}

The success shows the cokies content we want to submit: number:123456789.

Python web crawler (iv)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.