Beginner Scrapy's Crawl wooyun.org website

Last Update:2016-02-23 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Just beginning to learn python, for crawling data on the web, is still in the death of the hard set of code stage. No nonsense, just start my first crawl trip.

1. Create a project

1) Create a project command

Scrapy Startproject Wooyun

This command creates a Wooyun folder in the current directory

2) Define items.py

scrapy provides the item class, which is used to hold data crawled from the page. Somewhat similar to java serialization , except that deserialization is the byte stream is converted to a Java object, and item is a generic class that accesses data in the form of Key/value. item class pass through Scrapy. Field () to declare that the declared fields can be any type, such as integers, strings, lists, and so on.

Import Scrapyclass Wooyunitem (scrapy. Item): Commitdate = scrapy. Field () Bugname = Scrapy. Field () Author = scrapy. Field ()

3) I saved the crawled data in the MongoDB database, so I set it in settings.py

#禁止cookies to prevent being bancookies_enabled = Trueitem_pipelines = {' Wooyun.pipelines.WooyunPipeline ': #管道下载优先级别1-1000} Mongo_uri = "mongodb://localhost:27017/" mongo_database = "local"

4) Set the pipe pipelines.py

# -*- coding: utf-8 -*-import datetimeimport pymongo# define your  item pipelines here## don ' T forget to add your pipeline to  the item_pipelines setting# see: http://doc.scrapy.org/en/latest/topics/ Item-pipeline.htmlclass debugpipeline (object):    now =  Datetime.datetime.now ()     collection_name =  "Wooyun_"  + now.strftime ( '%y%m%d ')     def __init__ (self, mongo_uri, mongo_db):         self.mongo_uri = mongo_uri         self.mongo_db = mongo_db     @classmethod     def  From_crawler (Cls, crawler):         return cls (            &nBsp;mongo_uri=crawler.settings.get (' Mongo_uri '),             mongo_db=crawler.settings.get (' mongo_database ',  ' items ')          )     def open_spider (self, spider):         self.client = pymongo. Mongoclient (Self.mongo_uri)         self.db = self.client[ Self.mongo_db]    def close_spider (Self, spider):         self.client.close ()     def process_item (self, item,  Spider):         self.db[self.collection_name].insert (Dict (item))          return item

5) Finally write the spiders that defines the data you want to crawl

# -*- coding: utf-8 -*-import scrapyfrom debug.items import  Debugitemimport loggingclass debugspider (scrapy. Spider):    name =  "Wooyun"     allowed_domains = [ "Wooyun.org"]    start_urls = [         " HTTP://WWW.WOOYUN.ORG/BUGS/PAGE/1 ",        ]         def parse (self,response):         news_ page_num = 20        if response.status ==  200:            for j in range (1, NEWS_PAGE_NUM+1):                  item = debugitem () &NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&Nbsp;        item[' News_url '] = response.xpath ("//div[@class = ' Content ']/table[3]/tbody/tr[' +str (j) + "]/td[1]/a/@href"). Extract ()                  item[' News_title '] = response.xpath ("//div[@ class= ' content ']/table[3]/tbody/tr["+str (j) +"]/td[1]/a/text () "). Extract ()                  item[' news_date '] = response.xpath ("// div[@class = ' content ']/table[3]/tbody/tr["+str (j) +"]/th[1]/text () "). Extract ()                                       yield item                              for i in range (2,20):                         next_page_url =   "http://www.wooyun.org/bugs/page/" +str (i)                  yield scrapy. Request (next_page_url,callback=self.parse_news)                               Def parse_news (Self,response):        news_page_num =  20        if response.status == 200:                 for j in range (1, news_page_num+1):           &Nbsp;          item = debugitem ()                       item[' News_url '] = response.xpath ("//div[@class = ' content ']/table[3]/tbody/tr[" +str (j) + "]/td[1]/a/@ href "). Extract ()                      item[' News_title '] = response.xpath ("//div[@class = ' content ']/table[3]/ Tbody/tr["+str (j) +"]/td[

item[' news_date ' = Response.xpath ("//div[@class = ' content ']/table[3]/tbody/tr[" +str (j) + "]/th[1]/text ()"). Extract () yield Item

6) Enter command crawl

Scrapy Crawl Wooyun

Complete!!!!!!!!!!!

This article is from the "month Singer" blog, make sure to keep this source http://727229447.blog.51cto.com/10866573/1744242

Beginner Scrapy's Crawl wooyun.org website

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More