Python make a _python encyclopedia reptile example

Source: Internet
Author: User
Tags xpath

In the morning to do nothing to do, inexplicably pop-up embarrassing encyclopedia of the jokes, then I think since you come to the doorstep, then I will write a crawler to your site crawling climb it, as a practice practicing, second also calculate to find some fun.

In fact, these two days are also contact with the content of the database, you can crawl down the data stored in the database for later use. Okay, no more nonsense, let's take a look at the results of the data crawled by the program

It's worth mentioning that I was in the process of trying to crawl through the contents of the encyclopedia 30 pages, but there is a connection error, when I put the number of pages down to 20 pages, the program can run normally, do not know what the reason, eager to know the great God can tell me, grateful.

The program is very simple, directly on the source code slightly

# Coding=utf8 Import re import requests from lxml import etree from multiprocessing.dummy import Pool as ThreadPool Impo RT sys Reload (SYS) sys.setdefaultencoding (' Utf-8 ') def getnewpage (URL, total): nowpage = Int (Re.search (' (\d+) ', URL, r E.S). Group (1)) URL = [] for I in range (nowpage, total + 1): link = re.sub (' (\d+) ', '%s '% i, URL, re. S) urls.append (link) return URL def spider (urls): html = requests.get (URL) selector = etree. HTML (html.text) Author = Selector.xpath ('//*[@id = ' content-left ']/div/div[1]/a[2]/@title ') content = Selector.xpath (' *[@id = "Content-left"]/div/div[2]/text ()) vote = Selector.xpath ('//*[@id = ' Content-left ']/div/div[3]/span/i/text  () ' length = Len (author) for I in range (0, length): F.writelines (' Author: ' + author[i] + ' \ n ') f.writelines (' content: ' + STR (content[i]). replace (' \ n ', ') + ' \ n ') F.writelines (' Support: ' + vote[i] + ' \ n ') if __name__ = = ' __main__ ': F = op En (' info.txt ', ' a ') url = ' http://www.qiushibaike.com/text/page/1/' urls = getnewpage (URL,) pool = ThreadPool (4) Pool.map (Spider,urls) f.close ()

 

If there are parts that are not understood, you can refer to my first three articles in turn.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.