Python crawl novel website download novel

Source: Internet
Author: User

1 Preface

This small program is used to crawl novels of the novel website, the general pirate novel sites are very good crawl

Because this kind of website basically has no anti-creeping mechanism, so can crawl directly

This applet takes the website http://www.126shu.com/15/download full-time Mage as an example

2.requests Library

Document: http://www.python-requests.org/en/master/community/sponsors/

Requests Library is very easy to use, he can send a request to the website, receive information

One of the simplest examples is

= requests.get(url, timeout=30= r.apparent_encoding

The function of Raise_for_status () is to throw an exception if the server sends an error code such as 404/502 Raise_for_status ()

r.encoding = r.apparent_encoding This line of statements is used to adjust the encoding, encoding the code guess out of the encoding, may not be accurate

def get_html(url):    try:        = requests.get(url, timeout=30)        r.raise_for_status()        = r.apparent_encoding        return r.text    except:        return" HTMLERROR "
3.BeautifulSoup Library

BeautifulSoup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve customary document navigation, find

Document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Basic usage:

url_list== BeautifulSoup(html,'lxml'= soup.find_all('dd')

The reason for finding the DD element is that the link to each chapter is stored in that element through the F12 developer tool

The link can then be found in each element and placed in the list to be returned

    forin lista:        try:            url_list.append('http://www.126shu.com/'+Url.a['href'])        except:            print('获取连接失败')
4. Get content

Once the first two steps have been processed, a link to each chapter is obtained and the next step is to process the contents of each chapter.

With the F12 developer tool we can also find out where the text is, where the title is, so you can write the code and open the file for writing

defGet_content (URL): HTML=get_html (URL). replace (' <br/> ','\ n') Soup=BeautifulSoup (HTML,' lxml ')Try: TXT=Soup.find (' div ',ID=' content '). Text title=Soup.find (' div ',' hh '). Text with Open("full-time mage. txt",' A + ', encoding= ' Utf-8 ') asF:f.write (title+'\ n') F.write (TXT)except:Print(' ERROR ')
5. Complete code

Since I'm just downloading the later chapters, I've set the N variable to judge in the main () function.

ImportRequestsImportTime fromBs4ImportBeautifulSoupdefget_html (URL):Try: R=Requests.get (URL, timeout= -) R.raise_for_status () r.encoding=R.apparent_encodingreturnR.textexcept:return "Htmlerror"defGet_url (URL): url_list=[] HTML=get_html (URL) soup=BeautifulSoup (HTML,' lxml ') lista=Soup.find_all (' DD ') forUrlinchListaTry: Url_list.append (' http://www.126shu.com/'+url.a[' href '])except:Print(' Get connection failed ')returnUrl_listdefGet_content (URL): HTML=get_html (URL). replace (' <br/> ','\ n') Soup=BeautifulSoup (HTML,' lxml ')Try: TXT=Soup.find (' div ',ID=' content '). Text title=Soup.find (' div ',' hh '). Text with Open("full-time mage. txt",' A + ', encoding= ' Utf-8 ') asF:f.write (title+'\ n') F.write (TXT)except:Print(' ERROR ')defMain (URL): url_list=Get_url (URL) n= 1     forUrlinchUrl_list:n=N+ 1        ifN>1525: get_content (URL) URL= ' http://www.126shu.com/15/'if __name__ ==' __main__ ': Main (URL)

Python crawl novel website download novel

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.