Use Scrapy and MongoDB to develop a crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today we use the Scrapy framework to capture the latest issues (title and URL) of the stack overflow, and to save these issues to MongoDB, which is provided directly to the customer for query.

Installation

Before today's mission, we need to install two frameworks, namely Scrapy (1.1.0) and Pymongo (3.2.2).

Scrapy

If the system you are running is OSX or Linux, you can install it directly from Pip, and Windows needs to install some additional dependencies, because the reason for the computer is not explained.

$ pip Install Scrapy

Once the installation is complete, you can enter the following command directly in the Python shell, if no error occurs, the installation is complete

>>> Import Scrapy
>>>

Installing Pymongo and MongoDB

Because the system is OSX, it can be installed directly from the following statement.

Brew Install MongoDB

Running MongoDB is also particularly simple, just enter the following syntax below the terminal:

Mongod--dbpath=.

--dbpath is the path where the specified database is stored, and some files are generated below the path after running
?

Next we need to install the Pymongo, the same way PIP

$ pip Install Pymongo

Scrapy Project

Let's create a new Scrapy project, enter the following syntax in the terminal

$ scrapy Startproject Stack

?
Once the above command is complete, scrapy will create the corresponding file directly, which contains the basic information, so that you can modify the corresponding content.
?

Defining data

The items.py file is used to define the "container" of storage that we need to crawl the object
About Stackitem () is predefined and inherits from Scrapy. Item

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass StackItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    pass

Here we need to add two fields, respectively, to hold the crawled title and the link

from scrapy.item import Item,Fieldclass StackItem(Item):    # define the fields for your item here like:    title=Field()    url=Field()

Creating crawlers

We need to create a stack_spider.py file under the Spider folder, which contains the behavior of our crawlers when crawling. is to tell the crawler what we need to crawl and the source of the content.

from scrapy import Spiderfrom scrapy.selector import Selectorfrom stack.items import StackItemclass StackSpider(Spider):    name="stack"    allowed_domains=['stackoverflow.com']    start_urls = [        "http://stackoverflow.com/questions?pagesize=50&sort=newest",    ]

Name to define the crawler names
Allowed_domains the domain address of the specified crawler for crawling
Start_urls defines the URL address of the Web page that the crawler needs to crawl

XPath selection

Scrapy uses XPath to match the source of the corresponding data, HTML is a markup syntax, which defines a lot of tags and attributes, such as we define one of the following such a label, here we can through the '//div[@class = "Content" ] ' to find this tag, and then we can take the attribute or its child node

<div class= ' content ' >

Let's talk through chrome if we find the path to the XPath, we need to open the developer tool before we do the operation, you can hit developer mode by clicking the Developer-Dev tool, view----from the menu bar, or you can open it according to the shortcut.
?

After opening, we can click on the desired content and right click will pop up a menu, here we have the option to check to find the current content in the HTML corresponding location
?
Here, Chrome will automatically help us find the location, with the following analysis, we know that the title path is contained in a H3 tag below.
?

Now let's update the corresponding stack_spider.py script

from scrapy import Spiderfrom scrapy.selector import Selectorfrom stack.items import StackItemclass StackSpider(Spider):    name="stack"    allowed_domains=['stackoverflow.com']    start_urls = [        "http://stackoverflow.com/questions?pagesize=50&sort=newest",    ]    def parse(self,response):        questions=Selector(response).xpath('//div[@class="summary"]/h3')

Extracting data

After creating the fetch specification, we need to associate with the items entity we just created, and we continue to modify the stack_spider.py file

from scrapy import Spiderfrom scrapy.selector import Selectorfrom stack.items import StackItemclass StackSpider(Spider):    name="stack"    allowed_domains=['stackoverflow.com']    start_urls = [        "http://stackoverflow.com/questions?pagesize=50&sort=newest",    ]    def parse(self,response):        questions=Selector(response).xpath('//div[@class="summary"]/h3')        for question in questions:            item=StackItem()            item['title'] = question.xpath(                'a[@class="question-hyperlink"]/text()').extract()[0]            item['url'] = question.xpath(                'a[@class="question-hyperlink"]/@href').extract()[0]            yield item

By iterating through all the elements that conform to the //div[@class = "Summary" , and find the element content we really need to crawl from.

Test

Now let's test, just run the following script under the project directory to test it.

Scrapy Crawl Stack

Now we need to save all the information crawled to a file, we can add two parameters at the back-O and-t

Scrapy Crawl Stack-o items.json-t JSON

The following is the contents of the actual saved file containing the title and URL respectively
?

Storing elements in MongoDB

Here we need to save all the elements in the MongoDB collection.
We need to specify the appropriate pipeline in setinngs.py and add some database parameters before doing the operation.

ITEM_PIPELINES = {   'stack.pipelines.MongoDBPipeline': 300,}MONGODB_SERVER = "localhost"MONGODB_PORT = 27017MONGODB_DB = "stackoverflow"MONGODB_COLLECTION = "questions"

Pipeline Management

In the previous steps we have completed the parsing of the HTML, as well as the storage of the specified data. But at this point all the information is in memory, we need to crawl it to the data store in the database, here is the pipelines.py, this thing is responsible for the storage of data.
We have defined the parameters of the database above, and now we are finally in a useful.

import pymongofrom scrapy.conf import settingsfrom scrapy.exceptions import DropItemfrom scrapy import logclass MongoDBPipeline(object):    def __init__(self):        connection=pymongo.MongoClient(            settings['MONGODB_SERVER'],            settings['MONGODB_PORT']        )        db=connection[settings['MONGODB_DB']]        self.collection=db[settings['MONGODB_COLLECTION']]

The code above is that we created a mongodbpipeline () class and defined the initialization function to read just the arguments to create a MONGO connection.

Data processing

Next we need to define a function to process the parsed data

 #-*-coding:utf-8-*-# Define Your item pipelines here## Don ' t forget to add your PIP Eline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.conf Import settingsfrom scrapy.exceptions import dropitemfrom scrapy import Logclass mongodbpipeline (object): D EF __init__ (self): Connection=pymongo. Mongoclient (settings[' mongodb_server '), settings[' Mongodb_port ']) Db=connection[sett        ings[' mongodb_db '] [self.collection=db[settings[' Mongodb_collection ']] def process_item (Self,item,spider): Valid=true for data in Item:if not data:valid=false raise Dropitem (' Mi Ssing{0}! '.  Format (data) if Valid:self.collection.insert (Dict (item)) Log.msg (' question added to MongoDB Database! ', Level=log. Debug,spider=spider) return item

The above has completed the connection to the data, and the storage of the corresponding data

Test

We also run the following command in the stack directory

$ scrapy Crawl Stack

Once the content has been executed, there is no hint of error, congratulations on your data being correctly deposited into MongoDB.
Here, when we access the database through Robomongo, we find that we have created a StackOverflow database, and the following has successfully created a collections named questions. and the corresponding data has been deposited.
?

Use Scrapy and MongoDB to develop a crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Scrapy and MongoDB to develop a crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Scrapy and MongoDB to develop a crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support