How Python crawlers crawl V2EX website posts

Source: Internet
Author: User
A powerful web crawler System compiled by Chinese people with powerful WebUI. It is written in Python and has a distributed architecture. it supports multiple database backends. the powerful WebUI supports script editor, task monitor, project manager, and result viewer background:

PySpider: a powerful web crawler System compiled by Chinese people with powerful WebUI. It is written in Python and has a distributed architecture. it supports multiple database backends. the powerful WebUI supports script editor, task monitor, project manager, and result viewer. Online example: http://demo.pyspider.org/

Official documents: http://docs.pyspider.org/en/l...

Github: https://github.com/binux/pysp...

This crawler code Github address: https://github.com/zhisheng17...

More exciting articles can be found on the public account: yuanblog. please pay attention to them.

Let's take a look at the text!

Prerequisites:

You have installed Pyspider and MySQL-python (save data)

If you haven't installed it yet, read my previous article to avoid detours.

Some pitfalls of Pyspider framework learning

HTTP 599: SSL certificate problem: unable to get local issuer certificate error

Some errors I encountered:

Replace the URL of self. crawl of the on_start function:

@every(minutes=24 * 60)     def on_start(self):         self.crawl('https://www.v2ex.com/', callback=self.index_page, validate_cert=False)

Self. crawl tells pyspider to capture the specified page and then uses the callback function to parse the result.

@ Every) modifier, indicating that on_start will be executed once a day, so that you can catch the latest post.

Validate_cert = False. Otherwise, the HTTP 599: SSL certificate problem: unable to get local issuer certificate error will be reported.

Home page:

Click "run" in green, and you will see a red 1 ON follows. switch to the follows Panel and click "play in Green:

On the tab list page, we need to extract the URLs of all topic list pages. You may have discovered that the sample handler has extracted a very large URL.

Code:

@config(age=10 * 24 * 60 * 60)     def index_page(self, response):         for each in response.doc('a[href^="https://www.v2ex.com/?tab="]').items():             self.crawl(each.attr.href, callback=self.tab_page, validate_cert=False)

Because the length of the Post list page is not the same as that of the tab list page, a new callback is set to self. tab_page.

@ Config (age = 10 24 60*60) indicates that the page is valid for 10 days and will not be updated and crawled again.

Go List page:

Code:

@config(age=10 * 24 * 60 * 60)  def tab_page(self, response):  for each in response.doc('a[href^="https://www.v2ex.com/go/"]').items():  self.crawl(each.attr.href, callback=self.board_page, validate_cert=False)

Post details page (T ):

As you can see, some reply items appear in the results. we can remove them if we don't need them.

At the same time, we also need to allow him to implement the automatic page flip function.

Code:

@ Config (age = 10*24*60*60) def board_page (self, response): for each in response.doc ('a [href ^ = "https://www.v2ex.com/t/"] '). items (): url = each. attr. href if url. find ('# reply')> 0: url = url [0: url. find ('#')] self. crawl (url, callback = self. detail_page, validate_cert = False) for each in response.doc ('A. page_normal '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) # implement automatic page turning

Run after removing:

Code:

@ Config (priority = 2) def detail_page (self, response): title = response.doc ('h1 '). text () content = response.doc('p.topic_content').html (). replace ('"', '\"') self. add_question (title, content) # insert database return {"url": response. url, "title": title, "content": content ,}

Before inserting a database, we need to define an add_question function.

# Connecting to the database def _ init _ (self): self. db = MySQLdb. connect ('localhost', 'root', 'root', 'wenda', charset = 'utf8') def add_question (self, title, content): try: cursor = self. db. cursor () SQL = 'Insert into question (title, content, user_id, created_date, comment_count) values ("% s", "% s", % d, % s, 0) '% (title, content, random. randint (1, 10), 'Now () '); # insert the Database SQL statement print SQL cursor.exe cute (SQL) print cursor. lastrowid self. db. commit () commit T Exception, e: print e self. db. rollback ()

View crawler running results:

Then, query the local database GUI software to see that the data has been saved locally.

You can import it as needed.

At the beginning, I will tell you the crawler code. if you look at the project in detail, you will find the crawler data I uploaded. (For learning and use only. do not use it commercially !)

Of course, you will also see other crawler code. if you think it is good, you can give a Star, or if you are also interested, you can fork my project and study with me, this project will be updated for a long time.

Finally:

Code:

# Created by 10412 #! /Usr/bin/env python #-*-encoding: UTF-8-*-# Created on 20:43:00 # Project: V2EX from pyspider. libs. base_handler import * import re import random import MySQLdb class Handler (BaseHandler): crawl_config = {} def _ init _ (self): self. db = MySQLdb. connect ('localhost', 'root', 'root', 'wenda', charset = 'utf8') def add_question (self, title, content): try: cursor = self. db. cursor () SQL = 'Insert into question (title, content, user_id, created_date, comment_count) values ("% s", "% s", % d, % s, 0) '% (title, content, random. randint (1, 10), 'Now () '); print SQL cursor.exe cute (SQL) print cursor. lastrowid self. db. commit () commit T Exception, e: print e self. db. rollback () @ every (minutes = 24*60) def on_start (self): self. crawl (' https://www.v2ex.com/ ', Callback = self. index_page, validate_cert = False) @ config (age = 10*24*60*60) def index_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ ? Tab = "] '). items (): self. crawl (each. attr. href, callback = self. tab_page, validate_cert = False) @ config (age = 10*24*60*60) def tab_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ Go/"] '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) @ config (age = 10*24*60*60) def board_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ T/"] '). items (): url = each. attr. href if url. find ('# reply')> 0: url = url [0: url. find ('#')] self. crawl (url, callback = self. detail_page, validate_cert = False) for each in response.doc ('A. page_normal '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) @ config (priority = 2) def detail_page (self, response): title = response.doc ('h1 '). text () content = response.doc('p.topic_content').html (). replace ('"', '\"') self. add_question (title, content) # insert database return {"url": response. url, "title": title, "content": content ,}

The above is the content of V2EX website posts crawled by Python crawlers. For more information, see PHP Chinese website (www.php1.cn )!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.