1. The following is the crawler code of the ancient Poetry website , please see:
# encoding:utf-8import requestsimport reimport json def parse_page (URL): # 1. Request website headers = { "User-agent": "mozilla/5.0 (windows nt 6.1; win64; x64) AppleWebKit/537.36 (Khtml, like gecko) chrome/67.0.3396.62 safari/537.36 " } response = requests.get (url, headers=headers) text = response.text # 2. Parsing websites Titles = re.findall (R ' <div\sclass= "Cont" >.*?<b> (. *?) </b> ', text, re. Dotall) # print json.dumps (titles, encoding= "Utf-8", ensure_ascii= False) times = re.findall (R ' <p\sclass= "source" >.*?<a\s.*?> (. *?) </a> ', &NBSP;TEXT,&NBSp;re. Dotall) # print json.dumps (times, encoding= "Utf-8", ensure_ascii= False) authors = re.findall (R ' <p class= "source" >.*?<a.*?<a.* ?> (. *?) </a> ', text, re. Dotall) poems_ret = re.findall (R ' <div class= "Contson" id=.*?> (.*?) </div> ', text, re. Dotall) poems = [] for poem in poems_ Ret: temp = re.sub ("<.*?>", "", poem) poems.append (Temp.strip ()) # for index, value in enumerate (titles): # print titles[index] # print times[index] # &nbsP; print authors[index] # print poems[index ] # print "*" *50 # The ZIP function automatically implements the above combination results = [] for value in zip (Titles, times, authors, poems): title, time, author, poem = value result = { "title": title, "Dynasty": time, "Author": author, "Original": poem } print result["title"] results.append (Result) # print results def main (): url_base = "https://www.xzslx.net/gushi/" for i in range (1, 11): url = url_base.format (i) print " " *20+ "Beautiful Ancient Poetry" + " " *20 print "*" *50 parse_page (URL) print "*" *50 if __name__ == ' __main__ ': main ()
2. The result of the output is:
c:\ddd\python22\python.exe c:/pycharm/dytt_spider/poems.py Ancient Poetry ************** Guanshan Moon out of the Tianshan Mountains, boundless Sea of clouds. Long wind a few xxx, blow degrees Yumen guan. Han Xia Baideng Road, Hu Peep at the Blue Bay. [2] The origin of the war, no one yet. Shu-yi, homesickness more bitter Yan. Tall buildings When this night, sighs should not be idle. Ancient Poetry ******************************** Longxi Line four · Second oath sweep hun disregard body, 5,000 mink brocade Hu Yu lost. Poor and uncertain river side bone, still is the Spring maiden dream person! Ancient Poetry ************************************************** Chang ' E (Chang E should regret stealing elixir) Mica screen candle Shadow deep, River gradually fall Xiao Star sink. Chang-e should regret stealing elixir, Bihaiqingtian heart. **************************************************
Process finished with exit code 0
The web crawler of ancient poetry website to write the way, through the web crawler grab content