1Robots protocol
Robots protocol tells the search engine and crawler those pages can be crawled, those can not, usually stored in the robots.txt file, located in the root directory of the site
Demonstration of content in Robots.txt:
user-agent:*//Indicates the name of the search crawler, * indicates that it is valid for any crawler
disallow:///Indicates a directory that is not allowed to fetch,/indicates that all directories are not allowed, and that no write is allowed to fetch all directories
allow:/public///Indicates a directory that can be crawled in the exclusion disallow
2robotparse
Robotparse is used to specifically parse robots.txt files.
From Urllib.robotparser import Robotfileparser
Here's how to use the Robotfileparse ()
Set_url () to set the link to the robots.txt file. If you have already passed in the link when you created the Robotfileparser object, you do not need to use this method to set it.
Read (), reading the robots.txt file and analysis, note that this function is to perform a read and parse operation, if not call this method, the next judgment will be False, so you must remember to call this method, this method does not return any content, but the read operation is performed.
Parse (), used to parse the robots.txt file, the incoming parameter is the content of some rows robots.txt, which is parsed according to robots.txt syntax rules.
Can_fetch (), the method passes two parameters, the first is user-agent, the second is the URL to crawl, the content returned is whether the search engine can crawl the URL, the return result is True or False.
Mtime (), which returns the time of the last fetch and analysis robots.txt, is necessary for long-time analysis and crawling of search crawlers, and you may need to check regularly to crawl the latest robots.txt.
Modified (), which is also useful for long-time analysis and crawling of search crawlers, sets the current time to the time of the last fetch and analysis robots.txt.
From Urllib.robotparser import Robotfileparser
Rp=robotfileparse ()
Rp.set_url ( ‘http://www.jianshu.com/robots.txt‘ )
#也可以直接设置rp =robotfileparse ( ‘http://www.jianshu.com/robots.txt‘ )
Re.read ()
#也可以这么设置rp. Parse (Urlopen ( ‘http://www.jianshu.com/robots.txt‘ ). Read (). Decode (' Utf-8 '). Splict (' \ n '))
Print (Rp.can_fetch (' * ', ‘http://www.jianshu.com/p/b67554025d7d‘ ))
print(rp.can_fetch(‘*‘, "http://www.jianshu.com/search?q=python&page=1&type=collections"))
Python3 Crawler 5--Analysis Robots Protocol