Import Urllib.request
Import re
Def getnvvel ():
html = Urllib.request.urlopen ("http://www.quanshuwang.com/book/44/44683"). Read (). Decode (' GBK ') # download sould code
URLs = Re.findall (R ' <li><a href= "(.
?)" Title= ".?" > (.
?) </a></li> ', HTML) # Regular expression
title = "Douluo" # normoally,you should use Request.urlopen
f = open ('.. /novel/%s.txt '% title, ' W ') # Create a Douluo.txt
For URL in URLs:
Chapter_url = url[0]
Chapter_title = url[1]
Chapter_content_list = Urllib.request.urlopen (Chapter_url). Read (). Decode ("GBK")
Chapter_content_list = Re.findall (R ' </script>.? <br/> (. *?) <script type= "Text/javascript" > ", Chapter_content_list, re. S
For chapter_content in Chapter_content_list:
Chapter_content = Chapter_content.replace ("", "" ")
Chapter_content = Chapter_content.replace ("<br/>", "" ")
F.write (chapter_title) # type Chapter_title in Douluo.txt
F.write (chapter_content) # type chapter_content in Douluo.txt
F.write (' \ n ') #为了分行更清楚
Getnvvel ()
If you think your code is not easy to find, you can add a header like
headers = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/62.0.3202.89 safari/537.36 '}
html = Request.urlopen (URL, headers=headers)
Of course, you can do it for harmony.
Import time
At a later location plus a download location plus a
Time.sleep (1)
Of course, if you want to add some other anti-reptile stuff, you'll have to study harder.
Python ultra-simplified 18 lines of code crawl a novel