Shell debug:
Enter the directory where the project is located, scrapy shell "url"
In the following example:
Scrapy Shell http://www.w3school.com.cn/xml/xml_syntax.asp
The procedure code can be called in the following terminal interface as shown below:
Related page code:
We use Scrapy to crawl a specific website. Take the Schindler website as an example.
Here is the content of the homepage, I want to get a list of articles and the corresponding author name.
First, the title is defined in items.py, author. The Test1item here is similar to the Modul function in Django. Test1item can be seen here as a container. This container inherits from Scrapy. Item.
and item inherits from Dictitem. Therefore, it can be thought that Test1item is a dictionary function. Where title and author can be considered to be the 2 keywords in item. Which is the key in the dictionary.
Item (dictitem):
Test1item (scrapy. Item):
# define the fields for your item here is like:
# name = Scrapy. Field ()
Title=field ()
Author=field ()
The following is the beginning of the test_spider.py to write the page parsing code
Spider
Selector
Test1item
classTestspider (Spider):
Name="Test1" #这里的name必须和创建工程的名字一致, otherwise you will be prompted not to find the crawler project
allowd_domains=[' http://www.xunsee.com ']
start_urls=["http://www.xunsee.com/"]
defParse (self, Response):
Items=[]
Sel=selector (response)
Sites = Sel.xpath ('//*[@id = ' content_1 ']/div ') #这里是所有数据的入口. All of the following div is a list of stored articles and authors
forSiteinchSites
Item=test1item ()
Title=site.xpath (' span[@class = ' title ']/a/text () '). Extract ()
H=site.xpath (' span[@class = ' title ']/a/@href '). Extract ()
item[' title ']=[t.encode (' Utf-8 ') forTinchTitle
Author=site.xpath (' span[@class = ' author ']/a/text () '). Extract ()
item[' Author ']=[a.encode (' Utf-8 ') forAinchAuthor
Items.append (item)
returnItems
After obtaining the contents of title and author, save to item. And then store all the items in the item list.
In pipelines.py, modify the Test1pipeline as follows. The implementation of this class is to handle the items data returned in Testspider. That is where the data is stored. We store the items data in a JSON file.
Test1pipeline (object):
__init__ (self):
Self.file=codecs.open (' Xundu.json ',' WB ', encoding=' Utf-8 ')
Process_item (self, item, spider):
' \ n '
Self.file.write (Line.decode ("Unicode_escape"))
Item
After the project runs, you can see that a Xundu.json file has been generated in the directory. Where the run log can be viewed in the log file
From this crawler can see, the structure of scrapy is relatively simple. The three main steps are:
1 items.py define Content Store keywords
2. Crawl and return data from Web page data in a custom test_spider.py
3 pipelines.py The contents returned in the tes_spider.py are stored
Python web crawler scrapy Debugging and crawling Web pages