Just beginning to learn python, for crawling data on the web, is still in the death of the hard set of code stage. No nonsense, just start my first crawl trip.
1. Create a project
1) Create a project command
Scrapy Startproject Wooyun
This command creates a Wooyun folder in the current directory
2) Define items.py
scrapy provides the item class, which is used to hold data crawled from the page. Somewhat similar to java serialization , except that deserialization is the byte stream is converted to a Java object, and item is a generic class that accesses data in the form of Key/value. item class pass through Scrapy. Field () to declare that the declared fields can be any type, such as integers, strings, lists, and so on.
Import Scrapyclass Wooyunitem (scrapy. Item): Commitdate = scrapy. Field () Bugname = Scrapy. Field () Author = scrapy. Field ()
3) I saved the crawled data in the MongoDB database, so I set it in settings.py
#禁止cookies to prevent being bancookies_enabled = Trueitem_pipelines = {' Wooyun.pipelines.WooyunPipeline ': #管道下载优先级别1-1000} Mongo_uri = "mongodb://localhost:27017/" mongo_database = "local"
4) Set the pipe pipelines.py
# -*- coding: utf-8 -*-import datetimeimport pymongo# define your item pipelines here## don ' T forget to add your pipeline to the item_pipelines setting# see: http://doc.scrapy.org/en/latest/topics/ Item-pipeline.htmlclass debugpipeline (object): now = Datetime.datetime.now () collection_name = "Wooyun_" + now.strftime ( '%y%m%d ') def __init__ (self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def From_crawler (Cls, crawler): return cls ( &nBsp;mongo_uri=crawler.settings.get (' Mongo_uri '), mongo_db=crawler.settings.get (' mongo_database ', ' items ') ) def open_spider (self, spider): self.client = pymongo. Mongoclient (Self.mongo_uri) self.db = self.client[ Self.mongo_db] def close_spider (Self, spider): self.client.close () def process_item (self, item, Spider): self.db[self.collection_name].insert (Dict (item)) return item
5) Finally write the spiders that defines the data you want to crawl
# -*- coding: utf-8 -*-import scrapyfrom debug.items import Debugitemimport loggingclass debugspider (scrapy. Spider): name = "Wooyun" allowed_domains = [ "Wooyun.org"] start_urls = [ " HTTP://WWW.WOOYUN.ORG/BUGS/PAGE/1 ", ] def parse (self,response): news_ page_num = 20 if response.status == 200: for j in range (1, NEWS_PAGE_NUM+1): item = debugitem ()        &Nbsp; item[' News_url '] = response.xpath ("//div[@class = ' Content ']/table[3]/tbody/tr[' +str (j) + "]/td[1]/a/@href"). Extract () item[' News_title '] = response.xpath ("//div[@ class= ' content ']/table[3]/tbody/tr["+str (j) +"]/td[1]/a/text () "). Extract () item[' news_date '] = response.xpath ("// div[@class = ' content ']/table[3]/tbody/tr["+str (j) +"]/th[1]/text () "). Extract () yield item for i in range (2,20): next_page_url = "http://www.wooyun.org/bugs/page/" +str (i) yield scrapy. Request (next_page_url,callback=self.parse_news) Def parse_news (Self,response): news_page_num = 20 if response.status == 200: for j in range (1, news_page_num+1): &Nbsp; item = debugitem () item[' News_url '] = response.xpath ("//div[@class = ' content ']/table[3]/tbody/tr[" +str (j) + "]/td[1]/a/@ href "). Extract () item[' News_title '] = response.xpath ("//div[@class = ' content ']/table[3]/ Tbody/tr["+str (j) +"]/td[
item[' news_date ' = Response.xpath ("//div[@class = ' content ']/table[3]/tbody/tr[" +str (j) + "]/th[1]/text ()"). Extract () yield Item
6) Enter command crawl
Scrapy Crawl Wooyun
Complete!!!!!!!!!!!
This article is from the "month Singer" blog, make sure to keep this source http://727229447.blog.51cto.com/10866573/1744242
Beginner Scrapy's Crawl wooyun.org website