Python crawler Chinese novel Dot find novel and save to TXT (including Chinese garbled processing method)

Last Update:2018-07-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From some sites to see the novel often appear in the spam ads, in anger write a crawler, the novel Link crawl down to save to TXT, with requests_html all done, code is simple, easy to get started.

In the middle of the biggest problem is the coding problem, the first crawl down the content of the novel to keep the TXT garbled, the second URL coding problem, the third Unicodeencodeerror

First paste the source code, the back of the idea and the problems encountered detailed description.

 fromRequests_htmlImportHtmlsession as HSdefget_story (URL):GlobalF Session=HS () R=session.get (url,headers=headers) r.html.encoding='GBK'title=list (R.html.find ('title')) [0].text#get the title of the novelNr=list (R.html.find ('. Nr_nr')) [0].text#get the content of a novelNextpage=list (R.html.find ('#pb_next')) [0].absolute_links#get an absolute link to the next chapterNextpage=list (nextpage) [0]if(nr[0:10]=="_middle ();"): Nr=nr[11:]    if(nr[-14:]=='This chapter is not complete, click on the next page to continue reading'): Nr=nr[:-15]    Print(Title,r.url) f.write (title) F.write ('\ n') F.write (NR) f.write ('\ n')    returnnextpagedefsearch_story ():GlobalBookurlGlobalbookname Haveno=[] Booklist=[] BookName=input ("Please enter the name of the novel you are looking for: \ n") Session=HS () payload={'SearchType':'ArticleName','Searchkey': Bookname.encode ('GBK'),'T_btnsearch':"'} R=session.get (url,headers=headers,params=payload) Haveno=list (R.html.find ('. Havno'))#If the Haveno has a value, the lookup result is emptyBooklist=list (R.html.find ('. List-item'))#Booklist has a value, there are multiple search results     while(True):if(haveno!=[] andbooklist==[]):            Print('sorry~! No search for what you need! Please re-enter') search_story () Break        elif(haveno==[] andbooklist!=[]):            Print("Find {} The novel". Format (len (Booklist))) forBookinchBooklist:Print(book.text,book.absolute_links) search_story () Break        Else:            Print("find the results, the novel Link:", R.url) Bookurl=R.url BookName=BookName BreakGlobalBookurlGlobalBooknameurl='http://m.50zw.net/modules/article/waps.php'Headers= {    'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.170 safari/537.36 opr/53.0.2907.99'}search_story () Chapterurl=bookurl.replace (" Book","Chapters") Session=HS () R=session.get (chapterurl,headers=headers) Ch1url=list (R.html.find ('. Even')) [0].absolute_links#get the first chapter absolute linkCh1url=list (Ch1url) [0]GlobalFF=open (bookname+'. txt','a', encoding='GB18030', errors='Ignore')Print("start the download, each chapter needs to crawl, the speed is not fast, please wait .... \ n") NextPage=get_story (Ch1url) while(nextpage!=bookurl): NextPage=get_story (nextpage) f.close

The crawler thought and encountered the problem analysis as follows:

First find the novel, and the novel Link crawl down to the site http://m.50zw.net/modules/article/ Waps.php, for example, first open the link in the browser and right-click Check, select the network tag, I use the Chrome browser, press F1 settings to the network under the Preserve log check, easy to find log, to search for ' For example, after searching the result, he jumped directly to the url:http://m.50zw.net/book_86004/of the novel.

The view to request method is Get,request URL is http://m.50zw.net/modules/article/waps.php?searchtype=articlename&searchkey=%B5%DB% Ba%f3%ca%c0%ce%de%cb%ab&t_btnsearch=

And then analyzed the request parameters have three, searchtype first fixed with the title of the map to find, and Searchkey we entered is "enemy behind the World," the URL encoding became%B5%DB%BA%F3%CA%C0%CE%DE%CB%AB, In the Python IDE, we enter each of the following:

"Enemy Behind the World". Encode (' GBK '): B ' \xb5\xd0\xba\xf3\xca\xc0\xce\xde\xcb\xab '

"Enemy Behind the World". Encode (' Utf-8 '): B ' \xe6\x95\x8c\xe5\x90\x8e\xe4\xb8\x96\xe6\x97\xa0\xe5\x8f\x8c '

Control output We know the URL code here is using GBK.

Next we use code to verify the results of our analysis.

 fromRequests_htmlImportHtmlsession as Hsurl='http://m.50zw.net/modules/article/waps.php'Payload={'SearchType':'ArticleName','Searchkey':'Emperor of the future incomparable','T_btnsearch':"'}headers= {    'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/66.0.3359.170 safari/537.36 opr/53.0.2907.99'}session=HS () R=session.get (url,headers=headers,params=payload)Print(R.url)

Operation Result:

Http://m.50zw.net/modules/article/waps.php?searchtype=articlename&searchkey=%E5%B8%9D%E5%90%8E%E4%B8%96%E6 %97%a0%e5%8f%8c&t_btnsearch=

Compare the resulting URL with the URL we just entered after the manual input, the code inside if no encoding format is specified here, the URL encoding is urf-8, because the encoding problem we do not get the results we want, then we modify the code to specify the encoding to try

payload={'searchtype':'articlename','  Searchkey':' The Emperor's future is matchless '. Encode ('GBK') ,'t_btnsearch':'}

This runs the result to get the URL we want:

http://m.50zw.net/book_86004/

Good, success!!!

Then we're going to get a link to the first chapter, with requests_html to grab the absolute link.

Bookurl='http://m.50zw.net/book_86004/'chapterurl=bookurl.replace ("  book ","chapters") session=HS () R= Session.get (chapterurl,headers=headers) ch1url=list (r.html.find ('. Even')  )) [0].absolute_linksch1url=list (ch1url) [0]print(ch1url)

Operation Result:

Http://m.50zw.net/book_86004/26127777.html

Successful access to the first chapter link

Next we start to get the content of the novel and get the next chapter link until the whole novel is downloaded,

In this section encounter Unicodeencodeerror: ' GBK ' codec can ' t encode character ' \xa0 ' in position 46:illegal multibyte sequence, the problem is finally in op The En function opens txt with two parameters resolved encoding= ' gb18030 ', errors= ' ignore '.

Before also used another scheme, is to replace U ' \xa0 ' with its equivalent of U ', although solved the ' \xa0 ' error, but then appeared the ' \xb0 ' error, always can not appear a similar rror change code replacement once, so this scheme is discarded.

session=HS () R=session.get (ch1url,headers=headers) Title=list (R.html.find ('title')) [0].textnr=list (R.html.find ('. Nr_nr')) [0].text##nr =nr.replace (U ' \xa0 ', U ')Nextpage=list (R.html.find ('#pb_next')) [0].absolute_linksnextpage=list (nextpage) [0]if(nr[0:10]=="_middle ();"): Nr=nr[11:]if(nr[-14:]=='This chapter is not complete, click on the next page to continue reading'): Nr=nr[:-15]Print(Title,r.url)Print(nextpage) F=open ('The future of the Emperor is incomparable .','a', encoding='GB18030', errors='Ignore') F.write (title) F.write ('\ n') F.write (NR) f.write ('\ n')

Python crawler Chinese novel Dot find novel and save to TXT (including Chinese garbled processing method)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Chinese novel Dot find novel and save to TXT (including Chinese garbled processing method)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler Chinese novel Dot find novel and save to TXT (including Chinese garbled processing method)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support