How Python crawlers crawl V2EX website posts

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A powerful web crawler System compiled by Chinese people with powerful WebUI. It is written in Python and has a distributed architecture. it supports multiple database backends. the powerful WebUI supports script editor, task monitor, project manager, and result viewer background:

PySpider: a powerful web crawler System compiled by Chinese people with powerful WebUI. It is written in Python and has a distributed architecture. it supports multiple database backends. the powerful WebUI supports script editor, task monitor, project manager, and result viewer. Online example: http://demo.pyspider.org/

Official documents: http://docs.pyspider.org/en/l...

Github: https://github.com/binux/pysp...

This crawler code Github address: https://github.com/zhisheng17...

More exciting articles can be found on the public account: yuanblog. please pay attention to them.

Let's take a look at the text!

Prerequisites:

You have installed Pyspider and MySQL-python (save data)

If you haven't installed it yet, read my previous article to avoid detours.

Some pitfalls of Pyspider framework learning

HTTP 599: SSL certificate problem: unable to get local issuer certificate error

Some errors I encountered:

Replace the URL of self. crawl of the on_start function:

@every(minutes=24 * 60)     def on_start(self):         self.crawl('https://www.v2ex.com/', callback=self.index_page, validate_cert=False)

Self. crawl tells pyspider to capture the specified page and then uses the callback function to parse the result.

@ Every) modifier, indicating that on_start will be executed once a day, so that you can catch the latest post.

Validate_cert = False. Otherwise, the HTTP 599: SSL certificate problem: unable to get local issuer certificate error will be reported.

Home page:

Click "run" in green, and you will see a red 1 ON follows. switch to the follows Panel and click "play in Green:

On the tab list page, we need to extract the URLs of all topic list pages. You may have discovered that the sample handler has extracted a very large URL.

Code:

@config(age=10 * 24 * 60 * 60)     def index_page(self, response):         for each in response.doc('a[href^="https://www.v2ex.com/?tab="]').items():             self.crawl(each.attr.href, callback=self.tab_page, validate_cert=False)

Because the length of the Post list page is not the same as that of the tab list page, a new callback is set to self. tab_page.

@ Config (age = 10 24 60*60) indicates that the page is valid for 10 days and will not be updated and crawled again.

Go List page:

Code:

@config(age=10 * 24 * 60 * 60)  def tab_page(self, response):  for each in response.doc('a[href^="https://www.v2ex.com/go/"]').items():  self.crawl(each.attr.href, callback=self.board_page, validate_cert=False)

Post details page (T ):

As you can see, some reply items appear in the results. we can remove them if we don't need them.

At the same time, we also need to allow him to implement the automatic page flip function.

Code:

@ Config (age = 10*24*60*60) def board_page (self, response): for each in response.doc ('a [href ^ = "https://www.v2ex.com/t/"] '). items (): url = each. attr. href if url. find ('# reply')> 0: url = url [0: url. find ('#')] self. crawl (url, callback = self. detail_page, validate_cert = False) for each in response.doc ('A. page_normal '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) # implement automatic page turning

Run after removing:

Code:

@ Config (priority = 2) def detail_page (self, response): title = response.doc ('h1 '). text () content = response.doc('p.topic_content').html (). replace ('"', '\"') self. add_question (title, content) # insert database return {"url": response. url, "title": title, "content": content ,}

Before inserting a database, we need to define an add_question function.

# Connecting to the database def _ init _ (self): self. db = MySQLdb. connect ('localhost', 'root', 'root', 'wenda', charset = 'utf8') def add_question (self, title, content): try: cursor = self. db. cursor () SQL = 'Insert into question (title, content, user_id, created_date, comment_count) values ("% s", "% s", % d, % s, 0) '% (title, content, random. randint (1, 10), 'Now () '); # insert the Database SQL statement print SQL cursor.exe cute (SQL) print cursor. lastrowid self. db. commit () commit T Exception, e: print e self. db. rollback ()

View crawler running results:

Then, query the local database GUI software to see that the data has been saved locally.

You can import it as needed.

At the beginning, I will tell you the crawler code. if you look at the project in detail, you will find the crawler data I uploaded. (For learning and use only. do not use it commercially !)

Of course, you will also see other crawler code. if you think it is good, you can give a Star, or if you are also interested, you can fork my project and study with me, this project will be updated for a long time.

Finally:

Code:

# Created by 10412 #! /Usr/bin/env python #-*-encoding: UTF-8-*-# Created on 20:43:00 # Project: V2EX from pyspider. libs. base_handler import * import re import random import MySQLdb class Handler (BaseHandler): crawl_config = {} def _ init _ (self): self. db = MySQLdb. connect ('localhost', 'root', 'root', 'wenda', charset = 'utf8') def add_question (self, title, content): try: cursor = self. db. cursor () SQL = 'Insert into question (title, content, user_id, created_date, comment_count) values ("% s", "% s", % d, % s, 0) '% (title, content, random. randint (1, 10), 'Now () '); print SQL cursor.exe cute (SQL) print cursor. lastrowid self. db. commit () commit T Exception, e: print e self. db. rollback () @ every (minutes = 24*60) def on_start (self): self. crawl (' https://www.v2ex.com/ ', Callback = self. index_page, validate_cert = False) @ config (age = 10*24*60*60) def index_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ ? Tab = "] '). items (): self. crawl (each. attr. href, callback = self. tab_page, validate_cert = False) @ config (age = 10*24*60*60) def tab_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ Go/"] '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) @ config (age = 10*24*60*60) def board_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ T/"] '). items (): url = each. attr. href if url. find ('# reply')> 0: url = url [0: url. find ('#')] self. crawl (url, callback = self. detail_page, validate_cert = False) for each in response.doc ('A. page_normal '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) @ config (priority = 2) def detail_page (self, response): title = response.doc ('h1 '). text () content = response.doc('p.topic_content').html (). replace ('"', '\"') self. add_question (title, content) # insert database return {"url": response. url, "title": title, "content": content ,}

The above is the content of V2EX website posts crawled by Python crawlers. For more information, see PHP Chinese website (www.php1.cn )!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How Python crawlers crawl V2EX website posts

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How Python crawlers crawl V2EX website posts

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support