Python web crawler uses Scrapy to automatically crawl multiple pages

Last Update:2017-06-25 Source: Internet

Author: User

Tags xpath python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Scrapy crawler described earlier can only crawl individual pages. If we want to crawl multiple pages. such as how to operate the novel on the Internet. For example, the following structure. is the first article of the novel. can be clicked back to the table of contents or next page

The corresponding page code:

We'll look at the pages in the later chapters, and we'll see the previous page added.

The corresponding page code:

You can see it by comparing the page code above. The previous page, the table of contents, and the page code for the next page are in the href of the <a> element under <div>. The difference is that the first chapter has only 2 <a> elements, starting from two chapters there are 3 <a> elements. Therefore, we can decide whether to include the previous page and the next page by the number of <a> elements in <div>. The code is as follows

Finally get a link to the generated Web page. and call request to reapply the data for this page

So, in the pipelines.py file. We also need to modify the stored code. As follows. You can see here is not using JSON. Instead, open the TXT file directly for storage

Test1pipeline (object):
__init__ (self):
self.file="
Process_item (self, item, spider):
Self.file=open (R ' E:\scrapy_project\xiaoshuo.txt ',' WB ')
Self.file.write (item[' content ')
Self.file.close ()
Item

The complete code is as follows: It is important to note the use of yield two times. After the first yield, it will automatically go to the Test1pipeline for data storage and then the next page after the storage is finished. Then get the content of the next page via request

Testspider (Spider):
Name="Test1"
allowd_domains=[' http://www.xunsee.com ']

start_urls=["http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/1.shtml"]

Parse (self, Response):

init_urls="http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615"
Sel=selector (response)
context="
Content=sel.xpath ('//div[@id = ' content_1 ']/text () '). Extract ()

Content
Context=context+c.encode (' utf-8 ')

Items=test1item ()
items[' content ']=context
Count = Len (Sel.xpath ('//div[@id = "nav_1"]/a '). Extract ())

Count > 2:

Next_link=sel.xpath ('//div[@id = "nav_1"]/a ') [2].xpath (' @href '). Extract ()
Else:

Next_link=sel.xpath ('//div[@id = "nav_1"]/a ') [1].xpath (' @href '). Extract ()
Items

Next_link:

url=init_urls+'/'+n
Url
Request (Url,callback=self.parse)

There is a more convenient way to automatically crawl Web pages scrapy: Crawlspider

The spiders in the previous section can only parse pages in Start_urls. Although the rules for automatic crawling are also implemented in the previous chapter. But slightly responsible. In Scrapy, you can use Crawlspider to perform automatic crawling of web pages.

The rules for crawling are prototyped as follows:

class Scrapy.contrib.spiders.Rule (Link_extractor, Callback=none, Cb_kwargs=none, Follow=none,process_links=none, Process_request=none)

Linkextractor.: Its role is to define how to extract links from crawled pages

Callback points to a calling function that calls the function whenever it gets to the link from Linkextractor, which takes a response as the first argument. Note: It is forbidden to use parse as a callback function when using Crawlspider. Because Crawlspider uses the parse method to implement logic, using the parse function will cause the call to fail

Follow is a judgment value that indicates whether the link extracted from the response needs to be followed up

An example of extracting www.sina.com.cn from the scrapy shell

Allow in Linkextractor is for the href attribute only:

For example, the following link only makes regular expression extraction for the href attribute

The structure is as follows: Each link can be obtained.

Each link can be limited by restrict_xpaths, such as the following method:

Example 2: Or the previous read-only network as an example

Extract the address of the next section in the Web page:

Web address:

Http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/1.shtml

The relative URL address for the next page is 2.shtml.

Extracted by the following rules

>>> item=linkextractor (allow= (' \d\.shtml ')). Extract_links (response)

>>> for I in item:

... print I.ur

...

Http://www.xunread.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/2.shtml

Links to all chapters are also available directly through the navigation page:

C:\users\administrator>scrapy Shell Http://www.xunread.com/article/8c39f5a0-ca54

-44d7-86cc-148eee4d6615/index.shtml

From scrapy.linkextractors import Linkextractor

>>> item=linkextractor (allow= (' \d\.shtml ')). Extract_links (response)

>>> for I in item:

... print I.url

Get all the links below

Then the code that is constructed in Scrapy is as follows

Testspider (Crawlspider):
Name="Test1"
allowd_domains=[' http://www.xunsee.com ']

    start_urls=["http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/1.shtml"]

Rules= (Rule (Linkextractor (allow= (' \d\.shtml ')), callback=' Parse_item ', Follow=true),)
PrintRules
defParse_item (self, Response):
 PrintResponse.url
Sel=selector (response)
context="'
Content=sel.xpath ('//div[@id = ' content_1 ']/text () '). Extract ()
 forCinchContent
Context=context+c.encode (' Utf-8 ')
Items=test1item ()
items[' content ']=context
yieldItems

The key is rules= (Rule (Linkextractor (allow= (' \d\.shtml '), callback=' Parse_item ', follow=true),) This provides the rules for extracting pages. Take the example above. The process of crawling is divided into the following steps:

1 starting from http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/1.shtml , the first call to Parse_item , extract the Web page content with XPath , and then extract the rules of the Web page with rule, and extract the 2.shtml here.

2 enter 2.shtml. Enter 2.shtml then repeat the process of running the first step. Until no rule is extracted from the rules

We can also do a bit of optimization, set Start_urls index pages for pages

Http://www.xunsee.com/article/8c39f5a0-ca54-44d7-86cc-148eee4d6615/index.shtml

so through rule You can extract all the links in a moment. Then call Parse_item for each link to extract the page information. This efficiency is much more efficient than from 1.shtml .

Python web crawler uses Scrapy to automatically crawl multiple pages

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More