Recently busy to see the novel, find a full south of the three-uncle's novel website, decided to download down to see, so hands-on, in a lot of QQ Group Master's help (I am the expression is very rotten, the program complex is a few master guidance), spent three or four days to write a script
Requires BeautifulSoup and requests of two libraries
(I've written the notes as detailed as possible)
This program execution speed is very slow, ask Master to tell me how to optimize!!
#-*-coding:utf8-*-from BS4 Import beautifulsoupimport requestsimport reimport os# Open Web page read the desired URL and put it in a list R = Requests.get (' http://www.nanpaisanshu.org/'). Content #打开要读取的网页content =beautifulsoup (r). FindAll (' A ', href= Re.compile (R ' \ahttp://www.nanpaisanshu.org/[a-z]+\z ')) #在网页中找到需要的信息sc = str (content) #转换为string类型lists =[]lists = Sc.split (', ') lists = List (set (lists)) #删除列表中重复信息lisy =[]for line in Lists:p=line.split (' "') [1] #按" Split, take out the required information and write it into the array Lisy.append (p) #这里已经拥有需要的url #print p#print lisy# to open the URL traversal read to, save all the pages in the HTML file s = OS.GETCWD () #当前路径d = Os.sep #系统分隔符namef = ' aaa ' #文件加名称 #b = os.path.exists (S+D+NAMEF) #判断是存在f =os.path.exists (S+D+NAMEF) #判断是存在if F==false:os.mkdir ( S+D+NAMEF) #如果文件夹不存在就新建一个else: Print U ' already exists ' +nameffilenm = s+d+namef+d #路径i =1for line in lisy:r = Requests.get ( Line) #遍历打开所有url print r.content print ' \ n ' tfile=open (filenm+ ' Neirong ' +str (i) + '. html ', ' W ') i=i+1 TFILE.W Rite (r.content) #将网页内容写入文件 # Read the URL file in a compliant URL and write it into a TXT file for the i inRange (1,len (lisy) +1): fp = open (filenm+ ' Neirong ' +str (i) + '. html ', "R") of = open (filenm+ ' Neirong ' +str (i) + '. txt ', ' W ') Content = Fp.read () #将文件内容读取 p=re.compile (R ' http://www\.nanpaisanshu\.org/.*?\.html ') #正则匹配 #print p.find All (content) #print type (P.findall (content)) for line in P.findall (content): #print line+ ' \ n ' #if li Ne! = ' http://www.nanpaisanshu.org/9701.html ': of.write (line+ ' \ n ') #将匹配到的文件写入另一个文件中 #else: #conti Nue #of. Write (P.findall (content)) #关闭文件of. Close () Fp.close () tfile.close () #将txtfor I in range (1,len (lisy) +1): Ot=open (filenm+ ' Neirong ' +str (i) + ' txt ', ' R ') outfile=open (filenm+ ' Quanbu ' +str (i) + ' txt ', ' A + ') li=[] for line in OT : line = line.replace (' \ n ', ') li.append (line) #将url文件中的数据放进列表中 li = sorted (li) #给列表排序 for line in Li:print line #line = line.replace (' \ n ', ') R = Requests.get (line). Content #遍历打开所有url title= BeautifulSoup (R). Find ("DIV ", {' class ':" Post_title "}). H2 #取出标题 Content=beautifulsoup (R). FindAll (" div ", {' class ':" Post_entry "}) #取出内容 Sti=str (title). Replace ('
Sometimes the connection fails, then the program error, you should determine the Requests.get (URL). status_code! = 200 But I added the later found that running slower, each page is judged, sweat, probably because I have a few k on the speed of the reason will be abnormal
Python download online Read the script of Tomb Raider literary sketches