Today we use the Scrapy framework to capture the latest issues (title and URL) of the stack overflow, and to save these issues to MongoDB, which is provided directly to the customer for query.
Installation
Before today's mission, we need to install two frameworks, namely Scrapy (1.1.0) and Pymongo (3.2.2).
Scrapy
If the system you are running is OSX or Linux, you can install it directly from Pip, and Windows needs to install some additional dependencies, because the reason for the computer is not explained.
$ pip Install Scrapy
Once the installation is complete, you can enter the following command directly in the Python shell, if no error occurs, the installation is complete
>>> Import Scrapy
>>>
Installing Pymongo and MongoDB
Because the system is OSX, it can be installed directly from the following statement.
Brew Install MongoDB
Running MongoDB is also particularly simple, just enter the following syntax below the terminal:
Mongod--dbpath=.
--dbpath is the path where the specified database is stored, and some files are generated below the path after running
?
Next we need to install the Pymongo, the same way PIP
$ pip Install Pymongo
Scrapy Project
Let's create a new Scrapy project, enter the following syntax in the terminal
$ scrapy Startproject Stack
?
Once the above command is complete, scrapy will create the corresponding file directly, which contains the basic information, so that you can modify the corresponding content.
?
Defining data
The items.py file is used to define the "container" of storage that we need to crawl the object
About Stackitem () is predefined and inherits from Scrapy. Item
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass StackItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass
Here we need to add two fields, respectively, to hold the crawled title and the link
from scrapy.item import Item,Fieldclass StackItem(Item): # define the fields for your item here like: title=Field() url=Field()
Creating crawlers
We need to create a stack_spider.py file under the Spider folder, which contains the behavior of our crawlers when crawling. is to tell the crawler what we need to crawl and the source of the content.
from scrapy import Spiderfrom scrapy.selector import Selectorfrom stack.items import StackItemclass StackSpider(Spider): name="stack" allowed_domains=['stackoverflow.com'] start_urls = [ "http://stackoverflow.com/questions?pagesize=50&sort=newest", ]
- Name to define the crawler names
- Allowed_domains the domain address of the specified crawler for crawling
- Start_urls defines the URL address of the Web page that the crawler needs to crawl
XPath selection
Scrapy uses XPath to match the source of the corresponding data, HTML is a markup syntax, which defines a lot of tags and attributes, such as we define one of the following such a label, here we can through the '//div[@class = "Content" ] ' to find this tag, and then we can take the attribute or its child node
<div class= ' content ' >
Let's talk through chrome if we find the path to the XPath, we need to open the developer tool before we do the operation, you can hit developer mode by clicking the Developer-Dev tool, view----from the menu bar, or you can open it according to the shortcut.
?
After opening, we can click on the desired content and right click will pop up a menu, here we have the option to check to find the current content in the HTML corresponding location
?
Here, Chrome will automatically help us find the location, with the following analysis, we know that the title path is contained in a H3 tag below.
?
Now let's update the corresponding stack_spider.py script
from scrapy import Spiderfrom scrapy.selector import Selectorfrom stack.items import StackItemclass StackSpider(Spider): name="stack" allowed_domains=['stackoverflow.com'] start_urls = [ "http://stackoverflow.com/questions?pagesize=50&sort=newest", ] def parse(self,response): questions=Selector(response).xpath('//div[@class="summary"]/h3')
Extracting data
After creating the fetch specification, we need to associate with the items entity we just created, and we continue to modify the stack_spider.py file
from scrapy import Spiderfrom scrapy.selector import Selectorfrom stack.items import StackItemclass StackSpider(Spider): name="stack" allowed_domains=['stackoverflow.com'] start_urls = [ "http://stackoverflow.com/questions?pagesize=50&sort=newest", ] def parse(self,response): questions=Selector(response).xpath('//div[@class="summary"]/h3') for question in questions: item=StackItem() item['title'] = question.xpath( 'a[@class="question-hyperlink"]/text()').extract()[0] item['url'] = question.xpath( 'a[@class="question-hyperlink"]/@href').extract()[0] yield item
By iterating through all the elements that conform to the //div[@class = "Summary" , and find the element content we really need to crawl from.
Test
Now let's test, just run the following script under the project directory to test it.
Scrapy Crawl Stack
Now we need to save all the information crawled to a file, we can add two parameters at the back-O and-t
Scrapy Crawl Stack-o items.json-t JSON
The following is the contents of the actual saved file containing the title and URL respectively
?
Storing elements in MongoDB
Here we need to save all the elements in the MongoDB collection.
We need to specify the appropriate pipeline in setinngs.py and add some database parameters before doing the operation.
ITEM_PIPELINES = { 'stack.pipelines.MongoDBPipeline': 300,}MONGODB_SERVER = "localhost"MONGODB_PORT = 27017MONGODB_DB = "stackoverflow"MONGODB_COLLECTION = "questions"
Pipeline Management
In the previous steps we have completed the parsing of the HTML, as well as the storage of the specified data. But at this point all the information is in memory, we need to crawl it to the data store in the database, here is the pipelines.py, this thing is responsible for the storage of data.
We have defined the parameters of the database above, and now we are finally in a useful.
import pymongofrom scrapy.conf import settingsfrom scrapy.exceptions import DropItemfrom scrapy import logclass MongoDBPipeline(object): def __init__(self): connection=pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] ) db=connection[settings['MONGODB_DB']] self.collection=db[settings['MONGODB_COLLECTION']]
The code above is that we created a mongodbpipeline () class and defined the initialization function to read just the arguments to create a MONGO connection.
Data processing
Next we need to define a function to process the parsed data
#-*-coding:utf-8-*-# Define Your item pipelines here## Don ' t forget to add your PIP Eline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.conf Import settingsfrom scrapy.exceptions import dropitemfrom scrapy import Logclass mongodbpipeline (object): D EF __init__ (self): Connection=pymongo. Mongoclient (settings[' mongodb_server '), settings[' Mongodb_port ']) Db=connection[sett ings[' mongodb_db '] [self.collection=db[settings[' Mongodb_collection ']] def process_item (Self,item,spider): Valid=true for data in Item:if not data:valid=false raise Dropitem (' Mi Ssing{0}! '. Format (data) if Valid:self.collection.insert (Dict (item)) Log.msg (' question added to MongoDB Database! ', Level=log. Debug,spider=spider) return item
The above has completed the connection to the data, and the storage of the corresponding data
Test
We also run the following command in the stack directory
$ scrapy Crawl Stack
Once the content has been executed, there is no hint of error, congratulations on your data being correctly deposited into MongoDB.
Here, when we access the database through Robomongo, we find that we have created a StackOverflow database, and the following has successfully created a collections named questions. and the corresponding data has been deposited.
?
Use Scrapy and MongoDB to develop a crawler