From some sites to see the novel often appear in the spam ads, in anger write a crawler, the novel Link crawl down to save to TXT, with requests_html all done, code is simple, easy to get started.
In the middle of the biggest problem is the coding problem, the first crawl down the content of the novel to keep the TXT garbled, the second URL coding problem, the third Unicodeencodeerror
First paste the source code, the back of the idea and the problems encountered detailed description.
fromRequests_htmlImportHtmlsession as HSdefget_story (URL):GlobalF Session=HS () R=session.get (url,headers=headers) r.html.encoding='GBK'title=list (R.html.find ('title')) [0].text#get the title of the novelNr=list (R.html.find ('. Nr_nr')) [0].text#get the content of a novelNextpage=list (R.html.find ('#pb_next')) [0].absolute_links#get an absolute link to the next chapterNextpage=list (nextpage) [0]if(nr[0:10]=="_middle ();"): Nr=nr[11:] if(nr[-14:]=='This chapter is not complete, click on the next page to continue reading'): Nr=nr[:-15] Print(Title,r.url) f.write (title) F.write ('\ n') F.write (NR) f.write ('\ n') returnnextpagedefsearch_story ():GlobalBookurlGlobalbookname Haveno=[] Booklist=[] BookName=input ("Please enter the name of the novel you are looking for: \ n") Session=HS () payload={'SearchType':'ArticleName','Searchkey': Bookname.encode ('GBK'),'T_btnsearch':"'} R=session.get (url,headers=headers,params=payload) Haveno=list (R.html.find ('. Havno'))#If the Haveno has a value, the lookup result is emptyBooklist=list (R.html.find ('. List-item'))#Booklist has a value, there are multiple search results while(True):if(haveno!=[] andbooklist==[]): Print('sorry~! No search for what you need! Please re-enter') search_story () Break elif(haveno==[] andbooklist!=[]): Print("Find {} The novel". Format (len (Booklist))) forBookinchBooklist:Print(book.text,book.absolute_links) search_story () Break Else: Print("find the results, the novel Link:", R.url) Bookurl=R.url BookName=BookName BreakGlobalBookurlGlobalBooknameurl='http://m.50zw.net/modules/article/waps.php'Headers= { 'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.170 safari/537.36 opr/53.0.2907.99'}search_story () Chapterurl=bookurl.replace (" Book","Chapters") Session=HS () R=session.get (chapterurl,headers=headers) Ch1url=list (R.html.find ('. Even')) [0].absolute_links#get the first chapter absolute linkCh1url=list (Ch1url) [0]GlobalFF=open (bookname+'. txt','a', encoding='GB18030', errors='Ignore')Print("start the download, each chapter needs to crawl, the speed is not fast, please wait .... \ n") NextPage=get_story (Ch1url) while(nextpage!=bookurl): NextPage=get_story (nextpage) f.close
The crawler thought and encountered the problem analysis as follows:
First find the novel, and the novel Link crawl down to the site http://m.50zw.net/modules/article/ Waps.php, for example, first open the link in the browser and right-click Check, select the network tag, I use the Chrome browser, press F1 settings to the network under the Preserve log check, easy to find log, to search for ' For example, after searching the result, he jumped directly to the url:http://m.50zw.net/book_86004/of the novel.
The view to request method is Get,request URL is http://m.50zw.net/modules/article/waps.php?searchtype=articlename&searchkey=%B5%DB% Ba%f3%ca%c0%ce%de%cb%ab&t_btnsearch=
And then analyzed the request parameters have three, searchtype first fixed with the title of the map to find, and Searchkey we entered is "enemy behind the World," the URL encoding became%B5%DB%BA%F3%CA%C0%CE%DE%CB%AB, In the Python IDE, we enter each of the following:
"Enemy Behind the World". Encode (' GBK '): B ' \xb5\xd0\xba\xf3\xca\xc0\xce\xde\xcb\xab '
"Enemy Behind the World". Encode (' Utf-8 '): B ' \xe6\x95\x8c\xe5\x90\x8e\xe4\xb8\x96\xe6\x97\xa0\xe5\x8f\x8c '
Control output We know the URL code here is using GBK.
Next we use code to verify the results of our analysis.
fromRequests_htmlImportHtmlsession as Hsurl='http://m.50zw.net/modules/article/waps.php'Payload={'SearchType':'ArticleName','Searchkey':'Emperor of the future incomparable','T_btnsearch':"'}headers= { 'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.170 safari/537.36 opr/53.0.2907.99'}session=HS () R=session.get (url,headers=headers,params=payload)Print(R.url)
Operation Result:
Http://m.50zw.net/modules/article/waps.php?searchtype=articlename&searchkey=%E5%B8%9D%E5%90%8E%E4%B8%96%E6 %97%a0%e5%8f%8c&t_btnsearch=
Compare the resulting URL with the URL we just entered after the manual input, the code inside if no encoding format is specified here, the URL encoding is urf-8, because the encoding problem we do not get the results we want, then we modify the code to specify the encoding to try
payload={'searchtype':'articlename',' Searchkey':' The Emperor's future is matchless '. Encode ('GBK') ,'t_btnsearch':'}
This runs the result to get the URL we want:
http://m.50zw.net/book_86004/
Good, success!!!
Then we're going to get a link to the first chapter, with requests_html to grab the absolute link.
Bookurl='http://m.50zw.net/book_86004/'chapterurl=bookurl.replace (" book ","chapters") session=HS () R= Session.get (chapterurl,headers=headers) ch1url=list (r.html.find ('. Even') )) [0].absolute_linksch1url=list (ch1url) [0]print(ch1url)
Operation Result:
Http://m.50zw.net/book_86004/26127777.html
Successful access to the first chapter link
Next we start to get the content of the novel and get the next chapter link until the whole novel is downloaded,
In this section encounter Unicodeencodeerror: ' GBK ' codec can ' t encode character ' \xa0 ' in position 46:illegal multibyte sequence, the problem is finally in op The En function opens txt with two parameters resolved encoding= ' gb18030 ', errors= ' ignore '.
Before also used another scheme, is to replace U ' \xa0 ' with its equivalent of U ', although solved the ' \xa0 ' error, but then appeared the ' \xb0 ' error, always can not appear a similar rror change code replacement once, so this scheme is discarded.
session=HS () R=session.get (ch1url,headers=headers) Title=list (R.html.find ('title')) [0].textnr=list (R.html.find ('. Nr_nr')) [0].text##nr =nr.replace (U ' \xa0 ', U ')Nextpage=list (R.html.find ('#pb_next')) [0].absolute_linksnextpage=list (nextpage) [0]if(nr[0:10]=="_middle ();"): Nr=nr[11:]if(nr[-14:]=='This chapter is not complete, click on the next page to continue reading'): Nr=nr[:-15]Print(Title,r.url)Print(nextpage) F=open ('The future of the Emperor is incomparable .','a', encoding='GB18030', errors='Ignore') F.write (title) F.write ('\ n') F.write (NR) f.write ('\ n')
Python crawler Chinese novel Dot find novel and save to TXT (including Chinese garbled processing method)