Python web crawler scrapy Debugging and crawling Web pages

Source: Internet
Author: User
Tags xpath python web crawler

Shell debug:

Enter the directory where the project is located, scrapy shell "url"

In the following example:

Scrapy Shell http://www.w3school.com.cn/xml/xml_syntax.asp

The procedure code can be called in the following terminal interface as shown below:

Related page code:

We use Scrapy to crawl a specific website. Take the Schindler website as an example.

Here is the content of the homepage, I want to get a list of articles and the corresponding author name.

First, the title is defined in items.py, author. The Test1item here is similar to the Modul function in Django. Test1item can be seen here as a container. This container inherits from Scrapy. Item.

and item inherits from Dictitem. Therefore, it can be thought that Test1item is a dictionary function. Where title and author can be considered to be the 2 keywords in item. Which is the key in the dictionary.

Item (dictitem):

Test1item (scrapy. Item):
# define the fields for your item here is like:
# name = Scrapy. Field ()
Title=field ()
Author=field ()

The following is the beginning of the test_spider.py to write the page parsing code
Spider
Selector
Test1item
classTestspider (Spider):
Name="Test1" #这里的name必须和创建工程的名字一致, otherwise you will be prompted not to find the crawler project
allowd_domains=[' http://www.xunsee.com ']
start_urls=["http://www.xunsee.com/"]
defParse (self, Response):
Items=[]
Sel=selector (response)
Sites = Sel.xpath ('//*[@id = ' content_1 ']/div ') #这里是所有数据的入口. All of the following div is a list of stored articles and authors
forSiteinchSites
Item=test1item ()
Title=site.xpath (' span[@class = ' title ']/a/text () '). Extract ()
H=site.xpath (' span[@class = ' title ']/a/@href '). Extract ()
item[' title ']=[t.encode (' Utf-8 ') forTinchTitle
Author=site.xpath (' span[@class = ' author ']/a/text () '). Extract ()
item[' Author ']=[a.encode (' Utf-8 ') forAinchAuthor
Items.append (item)
returnItems
After obtaining the contents of title and author, save to item. And then store all the items in the item list.

In pipelines.py, modify the Test1pipeline as follows. The implementation of this class is to handle the items data returned in Testspider. That is where the data is stored. We store the items data in a JSON file.

Test1pipeline (object):
__init__ (self):
Self.file=codecs.open (' Xundu.json ',' WB ', encoding=' Utf-8 ')
Process_item (self, item, spider):
' \ n '
Self.file.write (Line.decode ("Unicode_escape"))
Item

After the project runs, you can see that a Xundu.json file has been generated in the directory. Where the run log can be viewed in the log file

From this crawler can see, the structure of scrapy is relatively simple. The three main steps are:

1 items.py define Content Store keywords

2. Crawl and return data from Web page data in a custom test_spider.py

3 pipelines.py The contents returned in the tes_spider.py are stored

Python web crawler scrapy Debugging and crawling Web pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.