Beginner Scrapy's Crawl wooyun.org website

Source: Internet
Author: User
Tags xpath

Just beginning to learn python, for crawling data on the web, is still in the death of the hard set of code stage. No nonsense, just start my first crawl trip.

1. Create a project

1) Create a project command

Scrapy Startproject Wooyun

This command creates a Wooyun folder in the current directory

2) Define items.py

scrapy provides the item class, which is used to hold data crawled from the page. Somewhat similar to java serialization , except that deserialization is the byte stream is converted to a Java object, and item is a generic class that accesses data in the form of Key/value. item class pass through Scrapy. Field () to declare that the declared fields can be any type, such as integers, strings, lists, and so on.

Import Scrapyclass Wooyunitem (scrapy. Item): Commitdate = scrapy. Field () Bugname = Scrapy. Field () Author = scrapy. Field ()

3) I saved the crawled data in the MongoDB database, so I set it in settings.py

#禁止cookies to prevent being bancookies_enabled = Trueitem_pipelines = {' Wooyun.pipelines.WooyunPipeline ': #管道下载优先级别1-1000} Mongo_uri = "mongodb://localhost:27017/" mongo_database = "local"

4) Set the pipe pipelines.py

# -*- coding: utf-8 -*-import datetimeimport pymongo# define your  item pipelines here## don ' T forget to add your pipeline to  the item_pipelines setting# see: http://doc.scrapy.org/en/latest/topics/ Item-pipeline.htmlclass debugpipeline (object):    now =  Datetime.datetime.now ()     collection_name =  "Wooyun_"  + now.strftime ( '%y%m%d ')     def __init__ (self, mongo_uri, mongo_db):         self.mongo_uri = mongo_uri         self.mongo_db = mongo_db     @classmethod     def  From_crawler (Cls, crawler):         return cls (            &nBsp;mongo_uri=crawler.settings.get (' Mongo_uri '),             mongo_db=crawler.settings.get (' mongo_database ',  ' items ')          )     def open_spider (self, spider):         self.client = pymongo. Mongoclient (Self.mongo_uri)         self.db = self.client[ Self.mongo_db]    def close_spider (Self, spider):         self.client.close ()     def process_item (self, item,  Spider):         self.db[self.collection_name].insert (Dict (item))          return item

5) Finally write the spiders that defines the data you want to crawl

# -*- coding: utf-8 -*-import scrapyfrom debug.items import  Debugitemimport loggingclass debugspider (scrapy. Spider):    name =  "Wooyun"     allowed_domains = [ "Wooyun.org"]    start_urls = [         " HTTP://WWW.WOOYUN.ORG/BUGS/PAGE/1 ",        ]         def parse (self,response):         news_ page_num = 20        if response.status ==  200:            for j in range (1, NEWS_PAGE_NUM+1):                  item = debugitem ()        &Nbsp;        item[' News_url '] = response.xpath ("//div[@class = ' Content ']/table[3]/tbody/tr[' +str (j) + "]/td[1]/a/@href"). Extract ()                  item[' News_title '] = response.xpath ("//div[@ class= ' content ']/table[3]/tbody/tr["+str (j) +"]/td[1]/a/text () "). Extract ()                  item[' news_date '] = response.xpath ("// div[@class = ' content ']/table[3]/tbody/tr["+str (j) +"]/th[1]/text () "). Extract ()                                       yield item                              for i in range (2,20):                         next_page_url =   "http://www.wooyun.org/bugs/page/" +str (i)                  yield scrapy. Request (next_page_url,callback=self.parse_news)                               Def parse_news (Self,response):        news_page_num =  20        if response.status == 200:                 for j in range (1, news_page_num+1):           &Nbsp;          item = debugitem ()                       item[' News_url '] = response.xpath ("//div[@class = ' content ']/table[3]/tbody/tr[" +str (j) + "]/td[1]/a/@ href "). Extract ()                      item[' News_title '] = response.xpath ("//div[@class = ' content ']/table[3]/ Tbody/tr["+str (j) +"]/td[
item[' news_date ' = Response.xpath ("//div[@class = ' content ']/table[3]/tbody/tr[" +str (j) + "]/th[1]/text ()"). Extract () yield Item

6) Enter command crawl

Scrapy Crawl Wooyun

Complete!!!!!!!!!!!


This article is from the "month Singer" blog, make sure to keep this source http://727229447.blog.51cto.com/10866573/1744242

Beginner Scrapy's Crawl wooyun.org website

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.