A python crawler Applet and a python crawler Applet
Cause
Late at night, I suddenly wanted to download some ebook to expand the kindle. I realized that python was too simple to learn. I didn't even learn any "decorators" or "multithreading.
Think of the python tutorial of Liao Xuefeng, Which is classic and famous. I just want to find a download of wood and pdf, but the result is not found !! An incomplete CSDN employee cheated me on a credit !! Nima !!
Angry, prepare to write a program to climb Liao Xuefeng's tutorial directly, and then convert html into an e-book.
Process
The process is very interesting. I use a superficial python knowledge to write python programs, crawl python tutorials, and learn python. A little excited ......
Sure enough, python is very convenient, and about 50 lines will be OK. Directly paste the Code:
# Coding: utf-8import urllibdomain = 'HTTP: // www.liaoxuefeng.com '# Liao Xuefeng domain name path = r'c: \ Users \ cyhhao2013 \ Desktop \ temp \ '# path to be saved in html # input = open (r 'C: \ Users \ cyhhao2013 \ Desktop \ 0.html ', 'R') head = input. read () # Open the main interface f = urllib of the python tutorial. urlopen ("http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000") home = f. read () f. close () # Replace all spaces and press enter (this makes it easy to get the url) geturl = home. replace ("\ N", "") geturl = geturl. replace ("", "") # obtain the string list containing the url = geturl. split (r'em; "> <ahref =" ') [1:] # The first page must be added to complete the list. insert (0, '/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000 ">') # Start traversing url Listfor li in list: url = li. split (R' "> ') [0] url = domain + url # splice url print url f = urllib. urlopen (url) html = f. read () # obtain the title to write the file name title = html. split ("<title>") [1] title = title. split ("- Liao Xuefeng's official website </title> ") [0] # re-enter the code, or add it to the path. The tragedy is title = title. decode ('utf-8 '). replace ("/", "") # truncate the body html = html. split (R' <! -- Block main --> ') [1] html = html. split (R'
It's just a short time. I use python!
Last
Link to HTML to epub e-book format: html.toepub.com
And Liao Xuefeng's Tutorial: The Git tutorial linking Liao Xuefeng is also very good ~
By the way, expand your own → _ → GitHub (the crawled html is also on github ~)
And personal blog: http://blog.zhusun.in/cyhhao
Original article: a python Crawler
By: cyhhao http://blog.zhusun.in/cyhhao/