1 Preface
This small program is used to crawl novels of the novel website, the general pirate novel sites are very good crawl
Because this kind of website basically has no anti-creeping mechanism, so can crawl directly
This applet takes the website http://www.126shu.com/15/download full-time Mage as an example
2.requests Library
Document: http://www.python-requests.org/en/master/community/sponsors/
Requests Library is very easy to use, he can send a request to the website, receive information
One of the simplest examples is
= requests.get(url, timeout=30= r.apparent_encoding
The function of Raise_for_status () is to throw an exception if the server sends an error code such as 404/502 Raise_for_status ()
r.encoding = r.apparent_encoding This line of statements is used to adjust the encoding, encoding the code guess out of the encoding, may not be accurate
def get_html(url): try: = requests.get(url, timeout=30) r.raise_for_status() = r.apparent_encoding return r.text except: return" HTMLERROR "
3.BeautifulSoup Library
BeautifulSoup is a python library that extracts data from HTML or XML files. It is able to use your favorite converter to achieve customary document navigation, find
Document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Basic usage:
url_list== BeautifulSoup(html,'lxml'= soup.find_all('dd')
The reason for finding the DD element is that the link to each chapter is stored in that element through the F12 developer tool
The link can then be found in each element and placed in the list to be returned
forin lista: try: url_list.append('http://www.126shu.com/'+Url.a['href']) except: print('获取连接失败')
4. Get content
Once the first two steps have been processed, a link to each chapter is obtained and the next step is to process the contents of each chapter.
With the F12 developer tool we can also find out where the text is, where the title is, so you can write the code and open the file for writing
defGet_content (URL): HTML=get_html (URL). replace (' <br/> ','\ n') Soup=BeautifulSoup (HTML,' lxml ')Try: TXT=Soup.find (' div ',ID=' content '). Text title=Soup.find (' div ',' hh '). Text with Open("full-time mage. txt",' A + ', encoding= ' Utf-8 ') asF:f.write (title+'\ n') F.write (TXT)except:Print(' ERROR ')
5. Complete code
Since I'm just downloading the later chapters, I've set the N variable to judge in the main () function.
ImportRequestsImportTime fromBs4ImportBeautifulSoupdefget_html (URL):Try: R=Requests.get (URL, timeout= -) R.raise_for_status () r.encoding=R.apparent_encodingreturnR.textexcept:return "Htmlerror"defGet_url (URL): url_list=[] HTML=get_html (URL) soup=BeautifulSoup (HTML,' lxml ') lista=Soup.find_all (' DD ') forUrlinchListaTry: Url_list.append (' http://www.126shu.com/'+url.a[' href '])except:Print(' Get connection failed ')returnUrl_listdefGet_content (URL): HTML=get_html (URL). replace (' <br/> ','\ n') Soup=BeautifulSoup (HTML,' lxml ')Try: TXT=Soup.find (' div ',ID=' content '). Text title=Soup.find (' div ',' hh '). Text with Open("full-time mage. txt",' A + ', encoding= ' Utf-8 ') asF:f.write (title+'\ n') F.write (TXT)except:Print(' ERROR ')defMain (URL): url_list=Get_url (URL) n= 1 forUrlinchUrl_list:n=N+ 1 ifN>1525: get_content (URL) URL= ' http://www.126shu.com/15/'if __name__ ==' __main__ ': Main (URL)
Python crawl novel website download novel