In the morning to do nothing to do, inexplicably pop-up embarrassing encyclopedia of the jokes, then I think since you come to the doorstep, then I will write a crawler to your site crawling climb it, as a practice practicing, second also calculate to find some fun.
In fact, these two days are also contact with the content of the database, you can crawl down the data stored in the database for later use. Okay, no more nonsense, let's take a look at the results of the data crawled by the program
It's worth mentioning that I was in the process of trying to crawl through the contents of the encyclopedia 30 pages, but there is a connection error, when I put the number of pages down to 20 pages, the program can run normally, do not know what the reason, eager to know the great God can tell me, grateful.
The program is very simple, directly on the source code slightly
# Coding=utf8 Import re import requests from lxml import etree from multiprocessing.dummy import Pool as ThreadPool Impo RT sys Reload (SYS) sys.setdefaultencoding (' Utf-8 ') def getnewpage (URL, total): nowpage = Int (Re.search (' (\d+) ', URL, r E.S). Group (1)) URL = [] for I in range (nowpage, total + 1): link = re.sub (' (\d+) ', '%s '% i, URL, re. S) urls.append (link) return URL def spider (urls): html = requests.get (URL) selector = etree. HTML (html.text) Author = Selector.xpath ('//*[@id = ' content-left ']/div/div[1]/a[2]/@title ') content = Selector.xpath (' *[@id = "Content-left"]/div/div[2]/text ()) vote = Selector.xpath ('//*[@id = ' Content-left ']/div/div[3]/span/i/text () ' length = Len (author) for I in range (0, length): F.writelines (' Author: ' + author[i] + ' \ n ') f.writelines (' content: ' + STR (content[i]). replace (' \ n ', ') + ' \ n ') F.writelines (' Support: ' + vote[i] + ' \ n ') if __name__ = = ' __main__ ': F = op En (' info.txt ', ' a ') url = ' http://www.qiushibaike.com/text/page/1/' urls = getnewpage (URL,) pool = ThreadPool (4) Pool.map (Spider,urls) f.close ()
If there are parts that are not understood, you can refer to my first three articles in turn.