HTML Download Module
The module is mainly based on the URL provided to download the corresponding URL of the Web page content. Use the module requets-html, add retry logic and set maximum retry times, while limiting access time to prevent long-time non-response resulting in program animation.
According to the status code returned to determine if the access is successful return to the source code, or start retrying, if an exception is also a retry operation.
From requests_html import htmlsessionfrom fake_useragent import useragentimport requestsimport timeimport Randomclass Gethtml (): Def __init__ (self,url= "http://wwww.baidu.com"): self.ua = useragent () Self.url=url Self . Session=htmlsession (mock_browser=true) #关于headers有个默认的方法 self.headers = Default_headers () #mock_browser indicates that the With UserAgent def get_source (self,url,retry=1): If Retry>3:print ("retry more than three times, jump out of the loop") Return None while Retry<3:try:req=self.session.get (url,timeout=10) If Req.status_code==requests.codes.ok:return req.text else: Time.sleep (Random.randint (0,6)) Except:print (' unfortunitely--an unknow Error Happ ened, please wait 0-6 seconds ') time.sleep (random.randint (0, 6)) Retry + = 1 Self.get_source (URL,Retry
HTML download module of Python crawler module