Python crawl novel website download novel

Last Update:2018-03-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Preface

This small program is used to crawl novels of the novel website, the general pirate novel sites are very good crawl

Because this kind of website basically has no anti-creeping mechanism, so can crawl directly

This applet takes the website http://www.126shu.com/15/download full-time Mage as an example

2.requests Library

Document: http://www.python-requests.org/en/master/community/sponsors/

Requests Library is very easy to use, he can send a request to the website, receive information

One of the simplest examples is

= requests.get(url, timeout=30= r.apparent_encoding

The function of Raise_for_status () is to throw an exception if the server sends an error code such as 404/502 Raise_for_status ()

r.encoding = r.apparent_encoding This line of statements is used to adjust the encoding, encoding the code guess out of the encoding, may not be accurate

def get_html(url):    try:        = requests.get(url, timeout=30)        r.raise_for_status()        = r.apparent_encoding        return r.text    except:        return" HTMLERROR "

3.BeautifulSoup Library

BeautifulSoup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve customary document navigation, find

Document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Basic usage:

url_list== BeautifulSoup(html,'lxml'= soup.find_all('dd')

The reason for finding the DD element is that the link to each chapter is stored in that element through the F12 developer tool

The link can then be found in each element and placed in the list to be returned

    forin lista:        try:            url_list.append('http://www.126shu.com/'+Url.a['href'])        except:            print('获取连接失败')

4. Get content

Once the first two steps have been processed, a link to each chapter is obtained and the next step is to process the contents of each chapter.

With the F12 developer tool we can also find out where the text is, where the title is, so you can write the code and open the file for writing

defGet_content (URL): HTML=get_html (URL). replace (' <br/> ','\ n') Soup=BeautifulSoup (HTML,' lxml ')Try: TXT=Soup.find (' div ',ID=' content '). Text title=Soup.find (' div ',' hh '). Text with Open("full-time mage. txt",' A + ', encoding= ' Utf-8 ') asF:f.write (title+'\ n') F.write (TXT)except:Print(' ERROR ')

5. Complete code

Since I'm just downloading the later chapters, I've set the N variable to judge in the main () function.

ImportRequestsImportTime fromBs4ImportBeautifulSoupdefget_html (URL):Try: R=Requests.get (URL, timeout= -) R.raise_for_status () r.encoding=R.apparent_encodingreturnR.textexcept:return "Htmlerror"defGet_url (URL): url_list=[] HTML=get_html (URL) soup=BeautifulSoup (HTML,' lxml ') lista=Soup.find_all (' DD ') forUrlinchListaTry: Url_list.append (' http://www.126shu.com/'+url.a[' href '])except:Print(' Get connection failed ')returnUrl_listdefGet_content (URL): HTML=get_html (URL). replace (' <br/> ','\ n') Soup=BeautifulSoup (HTML,' lxml ')Try: TXT=Soup.find (' div ',ID=' content '). Text title=Soup.find (' div ',' hh '). Text with Open("full-time mage. txt",' A + ', encoding= ' Utf-8 ') asF:f.write (title+'\ n') F.write (TXT)except:Print(' ERROR ')defMain (URL): url_list=Get_url (URL) n= 1     forUrlinchUrl_list:n=N+ 1        ifN>1525: get_content (URL) URL= ' http://www.126shu.com/15/'if __name__ ==' __main__ ': Main (URL)

Python crawl novel website download novel

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawl novel website download novel

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawl novel website download novel

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support