Python3 Crawler 5--Analysis Robots Protocol

Source: Internet
Author: User

1Robots protocol

Robots protocol tells the search engine and crawler those pages can be crawled, those can not, usually stored in the robots.txt file, located in the root directory of the site

Demonstration of content in Robots.txt:

user-agent:*//Indicates the name of the search crawler, * indicates that it is valid for any crawler

disallow:///Indicates a directory that is not allowed to fetch,/indicates that all directories are not allowed, and that no write is allowed to fetch all directories

allow:/public///Indicates a directory that can be crawled in the exclusion disallow

2robotparse

Robotparse is used to specifically parse robots.txt files.

From Urllib.robotparser import Robotfileparser

Here's how to use the Robotfileparse ()

Set_url () to set the link to the robots.txt file. If you have already passed in the link when you created the Robotfileparser object, you do not need to use this method to set it.

Read (), reading the robots.txt file and analysis, note that this function is to perform a read and parse operation, if not call this method, the next judgment will be False, so you must remember to call this method, this method does not return any content, but the read operation is performed.

Parse (), used to parse the robots.txt file, the incoming parameter is the content of some rows robots.txt, which is parsed according to robots.txt syntax rules.

Can_fetch (), the method passes two parameters, the first is user-agent, the second is the URL to crawl, the content returned is whether the search engine can crawl the URL, the return result is True or False.

Mtime (), which returns the time of the last fetch and analysis robots.txt, is necessary for long-time analysis and crawling of search crawlers, and you may need to check regularly to crawl the latest robots.txt.

Modified (), which is also useful for long-time analysis and crawling of search crawlers, sets the current time to the time of the last fetch and analysis robots.txt.

From Urllib.robotparser import Robotfileparser

Rp=robotfileparse ()

Rp.set_url ( ‘http://www.jianshu.com/robots.txt‘ )

#也可以直接设置rp =robotfileparse ( ‘http://www.jianshu.com/robots.txt‘ )

Re.read ()

#也可以这么设置rp. Parse (Urlopen ( ‘http://www.jianshu.com/robots.txt‘ ). Read (). Decode (' Utf-8 '). Splict (' \ n '))

Print (Rp.can_fetch (' * ', ‘http://www.jianshu.com/p/b67554025d7d‘ ))

print(rp.can_fetch(‘*‘, "http://www.jianshu.com/search?q=python&page=1&type=collections"))

Python3 Crawler 5--Analysis Robots Protocol

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.