A powerful web crawler System compiled by Chinese people with powerful WebUI. It is written in Python and has a distributed architecture. it supports multiple database backends. the powerful WebUI supports script editor, task monitor, project manager, and result viewer background:
PySpider: a powerful web crawler System compiled by Chinese people with powerful WebUI. It is written in Python and has a distributed architecture. it supports multiple database backends. the powerful WebUI supports script editor, task monitor, project manager, and result viewer. Online example: http://demo.pyspider.org/
Official documents: http://docs.pyspider.org/en/l...
Github: https://github.com/binux/pysp...
This crawler code Github address: https://github.com/zhisheng17...
More exciting articles can be found on the public account: yuanblog. please pay attention to them.
Let's take a look at the text!
Prerequisites:
You have installed Pyspider and MySQL-python (save data)
If you haven't installed it yet, read my previous article to avoid detours.
Some pitfalls of Pyspider framework learning
HTTP 599: SSL certificate problem: unable to get local issuer certificate error
Some errors I encountered:
Replace the URL of self. crawl of the on_start function:
@every(minutes=24 * 60) def on_start(self): self.crawl('https://www.v2ex.com/', callback=self.index_page, validate_cert=False)
Self. crawl tells pyspider to capture the specified page and then uses the callback function to parse the result.
@ Every) modifier, indicating that on_start will be executed once a day, so that you can catch the latest post.
Validate_cert = False. Otherwise, the HTTP 599: SSL certificate problem: unable to get local issuer certificate error will be reported.
Home page:
Click "run" in green, and you will see a red 1 ON follows. switch to the follows Panel and click "play in Green:
On the tab list page, we need to extract the URLs of all topic list pages. You may have discovered that the sample handler has extracted a very large URL.
Code:
@config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('a[href^="https://www.v2ex.com/?tab="]').items(): self.crawl(each.attr.href, callback=self.tab_page, validate_cert=False)
Because the length of the Post list page is not the same as that of the tab list page, a new callback is set to self. tab_page.
@ Config (age = 10 24 60*60) indicates that the page is valid for 10 days and will not be updated and crawled again.
Go List page:
Code:
@config(age=10 * 24 * 60 * 60) def tab_page(self, response): for each in response.doc('a[href^="https://www.v2ex.com/go/"]').items(): self.crawl(each.attr.href, callback=self.board_page, validate_cert=False)
Post details page (T ):
As you can see, some reply items appear in the results. we can remove them if we don't need them.
At the same time, we also need to allow him to implement the automatic page flip function.
Code:
@ Config (age = 10*24*60*60) def board_page (self, response): for each in response.doc ('a [href ^ = "https://www.v2ex.com/t/"] '). items (): url = each. attr. href if url. find ('# reply')> 0: url = url [0: url. find ('#')] self. crawl (url, callback = self. detail_page, validate_cert = False) for each in response.doc ('A. page_normal '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) # implement automatic page turning
Run after removing:
Code:
@ Config (priority = 2) def detail_page (self, response): title = response.doc ('h1 '). text () content = response.doc('p.topic_content').html (). replace ('"', '\"') self. add_question (title, content) # insert database return {"url": response. url, "title": title, "content": content ,}
Before inserting a database, we need to define an add_question function.
# Connecting to the database def _ init _ (self): self. db = MySQLdb. connect ('localhost', 'root', 'root', 'wenda', charset = 'utf8') def add_question (self, title, content): try: cursor = self. db. cursor () SQL = 'Insert into question (title, content, user_id, created_date, comment_count) values ("% s", "% s", % d, % s, 0) '% (title, content, random. randint (1, 10), 'Now () '); # insert the Database SQL statement print SQL cursor.exe cute (SQL) print cursor. lastrowid self. db. commit () commit T Exception, e: print e self. db. rollback ()
View crawler running results:
Then, query the local database GUI software to see that the data has been saved locally.
You can import it as needed.
At the beginning, I will tell you the crawler code. if you look at the project in detail, you will find the crawler data I uploaded. (For learning and use only. do not use it commercially !)
Of course, you will also see other crawler code. if you think it is good, you can give a Star, or if you are also interested, you can fork my project and study with me, this project will be updated for a long time.
Finally:
Code:
# Created by 10412 #! /Usr/bin/env python #-*-encoding: UTF-8-*-# Created on 20:43:00 # Project: V2EX from pyspider. libs. base_handler import * import re import random import MySQLdb class Handler (BaseHandler): crawl_config = {} def _ init _ (self): self. db = MySQLdb. connect ('localhost', 'root', 'root', 'wenda', charset = 'utf8') def add_question (self, title, content): try: cursor = self. db. cursor () SQL = 'Insert into question (title, content, user_id, created_date, comment_count) values ("% s", "% s", % d, % s, 0) '% (title, content, random. randint (1, 10), 'Now () '); print SQL cursor.exe cute (SQL) print cursor. lastrowid self. db. commit () commit T Exception, e: print e self. db. rollback () @ every (minutes = 24*60) def on_start (self): self. crawl (' https://www.v2ex.com/ ', Callback = self. index_page, validate_cert = False) @ config (age = 10*24*60*60) def index_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ ? Tab = "] '). items (): self. crawl (each. attr. href, callback = self. tab_page, validate_cert = False) @ config (age = 10*24*60*60) def tab_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ Go/"] '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) @ config (age = 10*24*60*60) def board_page (self, response): for each in response.doc ('a [href ^ =" https://www.v2ex.com/ T/"] '). items (): url = each. attr. href if url. find ('# reply')> 0: url = url [0: url. find ('#')] self. crawl (url, callback = self. detail_page, validate_cert = False) for each in response.doc ('A. page_normal '). items (): self. crawl (each. attr. href, callback = self. board_page, validate_cert = False) @ config (priority = 2) def detail_page (self, response): title = response.doc ('h1 '). text () content = response.doc('p.topic_content').html (). replace ('"', '\"') self. add_question (title, content) # insert database return {"url": response. url, "title": title, "content": content ,}
The above is the content of V2EX website posts crawled by Python crawlers. For more information, see PHP Chinese website (www.php1.cn )!