The code of Ethics for Python Crawlers---robots protocol

Source: Internet
Author: User

Before writing a crawler to crawl data, in order to avoid some of the copyright data later brought about a lot of legal issues,

You can avoid crawling certain pages by viewing the robots.txt file for your Web site.


Robots protocol, inform the crawler and other search engines those pages can crawl, which can not. It's just a passing moral code,

There is no mandatory provision, which is fully complied with by individual will. As a moral technician, abide by the robots agreement,

Help build a better internet environment.


The robots file address of the website is usually robots.txt after the homepage is added, such as Www.taobao.com/robots.txt


A simple to determine whether the user agent complies with the robots file requirements of the small program, eligible to download the webpage:


Import robotparserimport urllib2def download (url, user_agent= ' wswp ',  num_retries=2):     print  ' downloading: ',  url    headers = {' User-agent ':  user_agent}    request = urllib2. Request (url, headers=headers)     try:         Html = urllib2.urlopen (Request). Read ()     except urllib2. urlerror as e:        print  ' Download error: ',  e.reason        html = None         if num_retries > 0:            if hasattr (e,  ' code ')  and 500 <= e.code < 600:                 return download (url,num_retries-1)     return htmldef can _be_download (url, user_agent= ' wswp)               #设置一个默认的用户代理     rp = robotparser. Robotfileparser ()     url = url.split ('/') [2]                  #获取主页网址     rp.set_url ('/HTTP '  + str (URL)  +  '/robots.txt ')    #robots. txt address     rp.read ()     if rp.can_fetch (user_agent= ' wswp ',  url):         download (URL)


The code of Ethics for Python Crawlers---robots protocol

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.