Cause
Late at night suddenly want to download a little ebook to expand the Kindle, just think of Python learning too shallow, what "decorator" Ah, "multi-threaded" Ah did not learn.
It is a classic and famous Python tutorial to think of the great God of Liao Xuefeng. Just want to find the wood has a pdf version of the download, the results did not find!! CSDN has an incomplete and cheated me out of a point!! Nima!!
Angry, ready to write a program directly to climb the Liu Xuefeng's tutorial, and then HTML into an ebook.
Process
The process is interesting, with a shallow python knowledge, a Python program, and a Python tutorial to learn about Python. Think of a little excitement ...
Sure enough, Python is very convenient, 50 rows or so OK. Direct Sticker Code:
#Coding:utf-8ImportUrllibdomain='http://www.liaoxuefeng.com' #Liaoche's domain namePath = R'C:\Users\cyhhao2013\Desktop\temp\\' #HTML to save the path#a header file for HTMLInput = open (r'C:\Users\cyhhao2013\Desktop\0.html','R') Head=Input.read ()#Open the Python tutorial main interfacef = Urllib.urlopen ("http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000") Home=F.read () f.close ( )#Replace all spaces carriage return (so easy to get URLs)Geturl = Home.replace ("\ n","") Geturl= Geturl.replace (" ","")#get the string containing the URLList = Geturl.split (r'em; " ><ahref= "') [1:]#Obsessive Compulsive Disorder, make sure you add the first page to perfection.List.insert (0,'/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000 ">')#start traversing the URL List forLiinchList:url= Li.split (r'">') [0] URL= domain + URL#Patchwork URLs PrintURL F=urllib.urlopen (URL) HTML=F.read ()#get title in order to write file nametitle = Html.split ("<title>") [1] Title= Title.split ("-Liaoche's official website </title>") [0]#to turn the code, or add to the path is tragictitle = Title.decode ('Utf-8'). Replace ("/"," ") #Intercept Texthtml = Html.split (r'<!--block main ---') [1] HTML= Html.split (r'') [0] HTML= Html.replace (r'src= "','src= "'+domain)#plus the head and tail make up the complete HTMLhtml = head + html+"</body>" #Output FileOutput = open (path +"%d"% List.index (LI) + title +'. html','W') output.write (HTML) output.close ()
Life is short, I use Python!
At last
Attach HTML to epub ebook format online link: html.toepub.com
and Liaoche Tutorial: Link Liaoche git tutorial is also very very good oh ~
Add your own →_→ GitHub (the crawled HTML is on GitHub, too)
and personal blog: Http://blog.zhusun.in/cyhhao
Original from: A Python crawler applet
By:cyhhao http://blog.zhusun.in/cyhhao/
A Python Reptile applet