Scrapy crawls school news reports and scrapy crawls news

Source: Internet
Author: User

Scrapy crawls school news reports and scrapy crawls news

Capture all the news and consulting on the official website of Sichuan University Institute of Public Management (http://ggglxy.scu.edu.cn.

Lab Process

1. Determine the capture target.
2. Create capture rules.
3. 'write/debug' capture rules.
4. Obtain captured data

1. Determine the capture target

We need to capture all the news and information of the Public Management College of Sichuan University. Therefore, we need to know the layout structure of the official website of the Public Management College.


_20170515223045.png

We found that we wanted to capture all the news information and could not directly crawl it on the homepage of the official website. We needed to click "more" to enter the news topic.


Paste_Image.png


We have seen specific news topics, but this obviously does not meet our crawling needs: currently, dynamic news webpages can only capture news time, titles and URLs, but cannot capture news content. so we want to go to the news details page to capture the specific news content.

2. Create capture rules

Through the analysis in the first part, we will think that if we want to capture the specific news information, we need to click the news dynamic page to go to the news details page to capture the specific news content. let's click on a piece of news to try it out.


Paste_Image.png

We found that we can directly capture the required data on the news details page: title, time, content. URL.

Well, now we have a clear idea of capturing a news article. But how can we capture all the news content?
This is obviously hard for us.


We can see the page Jump button at the bottom of the news topic. Then we can capture all the news through the "next page" button.

Now, we can think of an obvious crawling rule:
Capture all the news links under the news topic and go to the news details link to capture all the news content.

3. 'write/debug' capture rules

To minimize the granularity of the debugging crawler, I will combine the compilation and debugging modules.
In crawler, I will implement the following functions:

1. Crawl all the news links under a news topic
2. Access the required data (mainly news content) for detailed news crawling through the crawled news link)
3. Crawl all news through loop.

The corresponding knowledge points are:

1. Crawl the basic data under a page.
2. Perform secondary crawling through the crawled data.
3. cyclically crawl all data on the webpage.

Let's just talk about it.

3.1 crawl all the news links under a news topic
Paste_Image.png

By analyzing the source code of the news topic, we find that the structure of the captured data is


Paste_Image.png

Then we only need to locate the crawler selector (li: newsinfo_box_cf) and then perform for loop crawling.

Write code
import scrapyclass News2Spider(scrapy.Spider):    name = "news_info_2"    start_urls = [        "http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",    ]    def parse(self, response):        for href in response.xpath("//div[@class='newsinfo_box cf']"):            url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())

Test, pass!


Paste_Image.png3.2 go to the news details page to crawl the required data (mainly news content)

Now I have obtained a set of URLs. Now I need to go to every URL to capture the title, time, and content I need. The code implementation is quite simple, you only need to input the URL and capture the corresponding data when the original code captures a URL. therefore, I only need to write another method to capture the news details page and use scapy. the request can be called.

Write code
# Capture method def parse_dir_contents (self, response): item = GgglxyItem () item ['date'] = response. xpath ("// div [@ class = 'detail _ zy_title ']/p/text ()"). extract_first () item ['href '] = response item ['title'] = response. xpath ("// div [@ class = 'detail _ zy_title ']/h1/text ()"). extract_first () data = response. xpath ("// div [@ class = 'detail _ zy_c pb30 mb30']") item ['content'] = data [0]. xpath ('string (.) '). extract () [0] yield item

After integration with the original code, there are:

Import scrapyfrom ggglxy. items import GgglxyItemclass News2Spider (scrapy. Spider): name = "maid" start_urls = ["http://ggglxy.scu.edu.cn/index.php? C = special & sid = 1 & page = 1 ",] def parse (self, response): for href in response. xpath ("// div [@ class = 'newsinfo _ box cf']"): url = response. urljoin (href. xpath ("div [@ class = 'news _ c fr ']/h3/a/@ href "). extract_first () # call the news capture method yield scrapy. request (url, callback = self. parse_dir_contents) # The capture method def parse_dir_contents (self, response): item = GgglxyItem () item ['date'] = response. xpath ("// div [@ class = 'detail _ zy_title ']/p/text ()"). extract_first () item ['href '] = response item ['title'] = response. xpath ("// div [@ class = 'detail _ zy_title ']/h1/text ()"). extract_first () data = response. xpath ("// div [@ class = 'detail _ zy_c pb30 mb30']") item ['content'] = data [0]. xpath ('string (.) '). extract () [0] yield item

Test, pass!


Paste_Image.png

Then we add a loop:

NEXT_PAGE_NUM = 1 NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1        if NEXT_PAGE_NUM<11:            next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUM            yield scrapy.Request(next_url, callback=self.parse)

Add to the original code:

import scrapyfrom ggglxy.items import GgglxyItemNEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider):    name = "news_info_2"    start_urls = [        "http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",    ]    def parse(self, response):        for href in response.xpath("//div[@class='newsinfo_box cf']"):            URL = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())            yield scrapy.Request(URL, callback=self.parse_dir_contents)        global NEXT_PAGE_NUM        NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1        if NEXT_PAGE_NUM<11:            next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUM            yield scrapy.Request(next_url, callback=self.parse)     def parse_dir_contents(self, response):            item = GgglxyItem()             item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()            item['href'] = response             item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()            data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")            item['content'] = data[0].xpath('string(.)').extract()[0]             yield item

Test:


Paste_Image.png

The number of captured items is 191, but we see that there are 193 news articles published on the official website, two fewer.
Why? We noticed that there are two log errors:
Problem locating: It was found that there are two hidden secondary columns in the school's news section:
For example:


Paste_Image.png


The corresponding URL is


Paste_Image.png


The URLs are different. No wonder you can't catch them!
Then we have to set special rules for the URLs of the two second-level columns. We only need to add the rules to determine whether the two second-level columns are used:

  if URL.find('type') != -1:      yield scrapy.Request(URL, callback=self.parse)

Assemble the original function:

import scrapyfrom ggglxy.items import GgglxyItemNEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider):    name = "news_info_2"    start_urls = [        "http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",    ]    def parse(self, response):        for href in response.xpath("//div[@class='newsinfo_box cf']"):            URL = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())            if URL.find('type') != -1:                yield scrapy.Request(URL, callback=self.parse)            yield scrapy.Request(URL, callback=self.parse_dir_contents)        global NEXT_PAGE_NUM        NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1        if NEXT_PAGE_NUM<11:            next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUM            yield scrapy.Request(next_url, callback=self.parse)     def parse_dir_contents(self, response):            item = GgglxyItem()             item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()            item['href'] = response             item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()            data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")            item['content'] = data[0].xpath('string(.)').extract()[0]             yield item

Test:


Paste_Image.png

We found that the number of captured data records increased from the previous 193 to 238, and there was no error in the log. It indicates that our crawling rule is OK!

4. Obtain captured data
     scrapy crawl news_info_2 -o 0016.json

If you have any questions during the learning process or want to obtain learning resources, join the learning exchange group.
626062078. Let's learn Python together!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.