Install Deployment Scrapy
Before installing scrapy, it is necessary to make sure that Python is already installed (currently Scrapy supports python2.5,python2.6 and Python2.7). The official documentation describes three ways to install, I use Easy_install to install, the first is to download the Windows version of Setuptools (download address: http://pypi.python.org/pypi/ Setuptools), after downloading all the way next will be available.
After installing Setuptool. Execute cmd, and then run the command:
Easy_install-u Scrapy
The same you can choose to use PIP installation, PIP address: Http://pypi.python.org/pypi/pip
The command to install Scrapy with PIP is
Pip Install Scrapy
If your computer was previously installed in Visual Studio 2008 or Visual Studio 2010, then the scrapy is complete. If the following error occurs: Unable to find vcvarsall.bat then you need to toss it down. You can install Visual Studio after you install it or resolve it in the following ways:
First install MinGW (mingw download Address: http://sourceforge.net/projects/mingw/files/), locate the Bin folder in the MinGW installation directory, Find Mingw32-make.exe, duplicate a copy of the name Make.exe;
Add the MinGW path to the environment variable path, for example, if I install MinGW into D:\MinGW\, add D:\MinGW\bin to path;
Open a command-line window and enter the directory where you want to install the code in the command-line window;
Enter the following command setup.py install BUILD–COMPILER=MINGW32 can be installed.
If "Xslt-config" is not an internal or external command, it is not a program or batch file that can be run. "Error, the main reason is lxml installation is not successful, as long as the http://pypi.python.org/simple/lxml/download exe file to install it.
You can get to the point below.
New Project
Let's use crawlers to get the top 250 movie information from the Watercress movie. Before we begin, we create a new Scrapy project. Because I use the Win7, I enter a directory in cmd where I want to save the code, and then execute:
D:\web\python>scrapy Startproject doubanmoive
This command creates a new directory doubanmoive in the current directory, with the following directory structure:
D:\web\python\doubanmoive>tree/ffolder PATH listing for volume Datavolume serial number is 00000200 34EC:9CB9D:.│SCRA Py.cfg│└─doubanmoive│items.py│pipelines.py│settings.py│__init__.py│└─spiders __init__.py
These documents are mainly:
- doubanmoive/items.py: Defines the content field to get, similar to the entity class.
- doubanmoive/pipelines.py: Project pipeline file to handle data crawled by spiders.
- doubanmoive/settings.py: Project configuration file
- Doubanmoive/spiders: The directory where the spider is placed
Define project (item)
Item is the container used to load the fetch data, compared to the entity classes (entities) in Java, open doubanmoive/items.py can see that the following code is created by default.
From Scrapy.item Import Item, Fieldclass Doubanmoiveitem (item): Pass
We just need to add the fields that need to be crawled in the Doubanmoive class, such as Name=field (), and finally complete the code according to our requirements below.
From Scrapy.item Import Item, Fieldclass Doubanmoiveitem (item): Name=field () #电影名 Year=field () #上映年份 Score=field () #豆瓣分数 Director=field () #导演 Classification=field () #分类 Actor=field () #演员
Writing crawler (spider)
Spiders are the most core class of the entire project, in which we define crawled objects (domain names, URLs) and crawl rules. Scrapy the tutorials in the official documentation are based on Basespider, but Basespider can only crawl a given list of URLs and cannot be extended outward based on an initial URL. But besides Basespider, there are many classes that can directly inherit the Spider, such as Scrapy.contrib.spiders.CrawlSpider.
Create a new moive_spider.py file in the Doubanmoive/spiders directory and fill in the code.
#-*-Coding:utf-8-*-from scrapy.selector import selectorfrom scrapy.contrib.spiders import Crawlspider,rulefrom scrapy . CONTRIB.LINKEXTRACTORS.SGML Import sgmllinkextractorfrom doubanmoive.items import Doubanmoiveitemclass moivespider ( Crawlspider): Name= "doubanmoive" allowed_domains=["movie.douban.com"] start_urls=["http://movie.douban.com/top250 "] rules=[rule (sgmllinkextractor (allow= (R ' http://movie.douban.com/top250\?start=\d+.* ')), rule ( Sgmllinkextractor (allow= (R ' http://movie.douban.com/subject/\d+ ')), callback= "Parse_item"),] def parse_item (self, Response): Sel=selector (response) Item=doubanmoiveitem () item[' name ']=sel.xpath ('//*[@id = "Content"]/h1/span[1]/ Text () '). Extract () item[' year ']=sel.xpath ('//*[@id = "Content"]/h1/span[2]/text () '). Re (r ' \ ((\d+) \) ') item[' score '] =sel.xpath ('//*[@id = ' interest_sectl ']/div/p[1]/strong/text () '). Extract () item[' director ']=sel.xpath ('//*[@id = ') Info "]/span[1]/a/text ()"). Extract () item[' classification ']= sel.xpath ('//span[@property = ' v:genre ']/text () '). Extract () item[' actor ']= sel.xpath ('//*[@id = ' info ']/span[3]/a[1]/text () '). Extract () return item
Code Description: Moivespider inherit scrapy in the Crawlspider, name, Allow_domains, Start_url to see what the meaning of the name, which rules slightly more complex, defined the URL of the crawl rules, in line with allow Links to regular expressions are added to the Scheduler (scheduler). The following rules can be obtained by analyzing the paging URL http://movie.douban.com/top250?start=25&filter=&type= of the watercress movie Top250
Rule (Sgmllinkextractor (allow= (R ' http://movie.douban.com/top250\?start=\d+.* '))),
And the page we really want to crawl is a detailed introduction to every movie, such as The Shawshank Redemption link for http://movie.douban.com/subject/1292052/, that only the number behind the subject is changed, according to the regular expression to get the following code. We need to crawl the content of this type of link, and then add the callback property, and response to the Parse_item function to handle.
Rule (Sgmllinkextractor (allow= (R ' http://movie.douban.com/subject/\d+ ')), callback= "Parse_item"),
The processing logic in the Parse_item function is very simple, get the code that matches the conditional link, and then assign the content to item and return the item Pipeline according to a certain rule fetch. Getting the contents of most tags does not require writing complex regular expressions, we can use XPath. XPath is a language for finding information in an XML document, but it can also be used in HTML. The following table lists the common expressions.
An expression |
Description |
NodeName |
Select all child nodes of this node. |
/ |
Select from the root node. |
// |
Selects the nodes in the document from the current node that matches the selection, regardless of their location. |
. |
Select the current node. |
.. |
Selects the parent node of the current node. |
@ |
Select the attribute. |
As//*[@id = "Content"]/h1/span[1]/text () gets the text content of the first element in the span list under the H1 element under any element with the ID content. We can use the Chrome Developer tool (F12) to get an XPath expression for the content, by clicking on the review element on the content that needs to be crawled, the developer tool appears below, the element is located, right-click on the content, and select Copy XPath.
Storing data
The crawler acquires the data and we need to store it in the database before we mention that the operation needs to be handled by the project pipeline (pipeline), which typically performs the following actions:
- Cleaning HTML Data
- Verify the data that is parsed (check whether the project contains the necessary fields)
- Check if duplicate data (delete if duplicate)
- Store the parsed data in the database
Since we get a variety of data formats, some are not easy to store in relational databases, so I wrote a mongodb after I finished the MySQL version of pipeline.
MySQL version:
#-*-Coding:utf-8-*-from scrapy import logfrom twisted.enterprise import adbapifrom scrapy.http import Requestimport My Sqldbimport mysqldb.cursorsclass Doubanmoivepipeline (object): Def __init__ (self): Self.dbpool = Adbapi. ConnectionPool (' mysqldb ', db = ' python ', user = ' root ', passwd = ' root ', Cursorclass = MySQLdb.cursors.DictCur Sor, charset = ' utf8 ', Use_unicode = False) def process_item (self, item, spider): query = Self.dbpool.runInteract Ion (Self._conditional_insert, item) query.adderrback (Self.handle_error) return item def _conditional_insert (SELF,TX, Item): Tx.execute ("select * from doubanmoive where m_name=%s", (item[' name '][0],)) Result=tx.fetchone () log.msg (result, Level=log. DEBUG) Print result if result:log.msg ("Item already stored in db:%s"% item,level=log. DEBUG) else:classification=actor= ' Lenclassification=len (item[' classification ') Lenactor=len (item[' actor ']) fo R N in xrange (lenclassification): classification+=item[' ClassifiCation '][n] if n
MongoDB version:
#-*-Coding:utf-8-*-import pymongofrom scrapy.exceptions import dropitemfrom scrapy.conf import Settingsfrom scrapy Imp ORT Logclass Mongodbpipeline (object): #Connect to the MongoDB database Def __init__ (self): connection = Pymongo. Connection (settings[' mongodb_server '), settings[' Mongodb_port ']) db = connection[settings[' mongodb_db '] self.collection = db[settings[' Mongodb_collection ']] def process_item (self, item, spider): #Remove Invalid Data valid = True for data in item:if not data:valid = False raise Dropitem ("Missing%s of blogpost from%s"% (data, item[' URL ']) If valid: #Insert data into Database new_moive=[{"name": item[' name '][0], "year": item[' Year '][0], "score ": item[' score '][0]," director ": item[' director '," Classification ": item[' classification ']," actor ": item[' actor '] }] Self.collection.insert (new_moive) log.msg ("Item wrote to MongoDB database%s/%s"% (settings[' mongodb_db '), sett ings[' mongodb_collection '), Level=log. DEBUG,Spider=spider) return item
You can see that the basic processing flow is the same, but MySQL is not very convenient to have to convert the array type of data through the delimiter. MongoDB supports multiple types of data such as list, Dict, and so on.
Configuration file
You will also need to add some configuration information to the settings.py before running the crawler.
Bot_name = ' doubanmoive ' spider_modules = [' doubanmoive.spiders ']newspider_module = ' doubanmoive.spiders ' ITEM_ Pipelines={' Doubanmoive.mongo_pipelines. Mongodbpipeline ': ' doubanmoive.pipelines.DoubanmoivePipeline ': 400,}log_level= ' DEBUG ' Download_delay = 2randomize_download_delay = Trueuser_agent = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.54 safari/536.5 ' cookies_enabled = Truemongod B_server = ' localhost ' mongodb_port = 27017mongodb_db = ' python ' mongodb_collection = ' test '
MySQL and MongoDB two pipeline files are defined in Item_pipelines, followed by numbers that represent the order of precedence of execution, in the range of 0~1000. And the middle of the download_delay and other information is to prevent the crawler by the watercress ban off, increased some random delay, browser agents and so on. The last is the configuration information of MongoDB, MySQL can also refer to this way to write.
So far, the crawler crawling the Watercress film has been completed. Execute scrapy crawl on the command line doubanmoive let the spider crawl!