Need your Python installed with requests module, if not installed execute the following command installation
PIP3 Install Requests
Take the recent comparison of the fire of the novel "Demon Road founder" as an example.
Here is the entire script
Import requests,redef get_content (url,timeout=10): req = Requests.get (url=url,timeout=timeout) return req.textdef get_title (html,re_ Title): ret = re_title.search (HTML) if ret: ret = ret.group () tmp = ret.split ('_') [0] tmp = Tmp.replace (' <title> ', ') tmp = tmp.strip () return tmpdef get_body (html,ret_body): ret_body = re_body.search (HTML) if ret_body: ret = ret_body.group () tmp = re_clear_header.Sub (R ' \2 ', ret) tmp = tmp.replace (R ' ', ' '). Replace (R ' <br /><br /> ', ' \ n '). Replace (R ' <br /> ', ' \ n ') tmp = tmp.replace (R ' 2k novel reading Net </p> ', ' \ n ') return tmpif __name__ == ' __main__ ': Mdzs = open (' Mdzs.txt ', ' W ') re_title = re.compile (R ' <title> (. *?) </title> ') re_body = re.compile (R ' <p class= "Text" > (. *?) </p> ', Re. S) re_clear_header = re.compile (R ' (.*</script>) (. *) ', re. S) first_page = 19613532 for i in range (: page = first_page + i ) url = r ' https://www.2kxs.com/xiaoshuo/96/96717/{}.html '. Format (page) try: html = get_content (URL) title = get_title (Html,re_title) Mdzs.write (title + ' \ n ') Body = get_body (html,re_body) mdzs.write (body) print (' {} Is success '. Format (URL)) except exception as e: print (' url :{} , error: {} '. Format (URL, e))
The site is a novel website, typesetting and Web page URLs are more regular, so it is relatively simple to achieve
"Crawl" a novel in Python