Scrapy Crawl College News report Instance

Source: Internet
Author: User
Tags xpath
Crawl Sichuan University Public Management Institute website () all the press inquiries.

Experimental process

1. Determine the fetch target.
2. Develop crawl rules.
3. ' Write/debug ' crawl rules.
4. Get Fetch data

1. Determine the fetch target

The goal we need to crawl this time is all the news from Sichuan University School of Public administration. So we need to know the layout of the official website of the School of Public Management.


_20170515223045.png

Here we found that want to catch all the news information, not directly on the homepage of the website to crawl, need to click "More" into the general news column inside.


Paste_image.png


We saw the specific news column, but this obviously does not meet our crawling needs: the current news page can only crawl news time, title and URL, but does not crawl the content of the news. So we want to go to the news detail page to grab the details of the news.

2. Develop crawl rules

Through the first part of the analysis, we will think that if we want to crawl a piece of news specific information, we need to click on the News page to enter the news details page crawl to the specific content of the news. Let's try it on a piece of news.


Paste_image.png

We found that we were able to capture the data we needed directly on the News detail page: Title, time, content. Url.

Well, now we know the idea of grabbing a piece of news. But how to crawl all the news content?
This is obviously hard for us.


We can see the page Jump button at the bottom of the news column. Then we could crawl all the news with the "next Page" button.

So tidy up your ideas and we can think of an obvious crawl rule:
Crawl through all the news links under ' news section ' and go to the news details link to grab all the news content.

3. ' Write/debug ' crawl rules

In order to make the granularity of the debug crawler as small as possible, I will write and debug modules together.
In the crawler, I will implement the following functional points:

1. Crawl out of one page of all news links under the news column
2. Crawl through a page of news links into the news details crawl the required data (mainly news content)
3. Crawl through the loop to all the news.

The respective points of knowledge are:

1. Crawl the underlying data from a page.
2. Crawl through the data for two times.
3. Crawl all data by looping through the page.

Talk not much, now open dry.

3.1 Crawling out of one page all news links under the news column


Paste_image.png

Through the analysis of the source code of the News column, we find that the structure of the captured data is


Paste_image.png

Then we only need to locate the crawler selector (LI:NEWSINFO_BOX_CF), and then for the For Loop crawl.

Writing code

Import Scrapyclass News2spider (scrapy. Spider):    name = "News_info_2"    start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1& Page=1 ",    ]def Parse (self, Response): for href in Response.xpath ("//div[@class = ' Newsinfo_box cf '] "):            URL = Response.urljoin (Href.xpath ("div[@class = ' News_c fr ']/h3/a/@href"). Extract_first ())

Test, pass!


Paste_image.png

3.2 By crawling to a page of news links into the news details crawl the required data (mainly news content)

Now I get a set of URLs, now I need to go into each URL to grab the title, time and content I need, the code implementation is also very simple, just need to catch a URL in the original code to enter the URL and fetch the corresponding data. So, I just need to write a crawl that goes to the news detail page, And you can use the Scapy.request call.

Writing code

#进入新闻详情页的抓取方法def parse_dir_contents (Self, response): item = Ggglxyitem () item[' Date ' = Response.xpath ("//div[@class = ') Detail_zy_title ']/p/text () "). Extract_first () item[' href '] = responseitem[' title ' = Response.xpath ("//div[@class = ') Detail_zy_title ']/h1/text () "). Extract_first ()        data = Response.xpath ("//div[@class = ' Detail_zy_c pb30 mb30 '] ") item[' content ' = Data[0].xpath (' string (.) '). Extract () [0]        yield item

After integration into the original code, there are:

import scrapyfrom ggglxy.items import ggglxyitemclass news2spider (scrapy.    Spider): name = "News_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def Parse (self, Response): as href in Response.xpath ("//div[@class = ' Newsinfo_box cf ']"): url = Response.urljoi N (Href.xpath ("div[@class = ' News_c fr ']/h3/a/@href"). Extract_first ()) #调用新闻抓取方法yield scrapy.            Request (URL, callback=self.parse_dir_contents) #进入新闻详情页的抓取方法 def parse_dir_contents (self, Response): item = Ggglxyitem () item[' Date ' = Response.xpath ("//div[@class = ' detail_zy_title ']/p/text ()"). Extract_first () item[' href '] = response item[' title '] = Response.xpath ("//div[@class = ' detail_zy_title ']/h1/text () "). Extract_first () data = Response.xpath ("//div[@class = ' Detail_zy_c pb30 mb30 '] ") item[' content '] = Data[0].xpath (' string (.) '). Extract () [0]yield Item 

Test, pass!


Paste_image.png

Then we add a loop:

Next_page_num = 1 Next_page_num = next_page_num + 1if next_page_num<11:next_url = ' http://ggglxy.scu.edu.cn/index.php ? c=special&sid=1&page=%s '% next_page_num            yield scrapy. Request (Next_url, Callback=self.parse)

Add to Original code:

Import scrapyfrom ggglxy.items Import ggglxyitemnext_page_num = 1class News2spider (scrapy.    Spider): name = "News_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def Parse (self, Response): as href in Response.xpath ("//div[@class = ' Newsinfo_box cf ']"): URL = Response.urljoi N (Href.xpath ("div[@class = ' News_c fr ']/h3/a/@href"). Extract_first ()) yield scrapy. Request (URL, callback=self.parse_dir_contents) global Next_page_num next_page_num = next_page_num + 1if next_page_nu M<11:next_url = ' http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s '% Next_page_numyield s Crapy.             Request (Next_url, Callback=self.parse) def parse_dir_contents (self, response): item = Ggglxyitem () item[' Date ' = Response.xpath ("//div[@class = ' detail_zy_title ']/p/text ()"). Extract_first () item[' href '] = respon Se item[' title ' = Response.xpath ("//div[@class = ' detail_zy_title ']/h1/text ()"). EXtract_first () data = Response.xpath ("//div[@class = ' Detail_zy_c pb30 mb30 ']") item[' content '] = data [0].xpath (' String (.) '). Extract () [0] Yield item

Test:


Paste_image.png

The number of catches is 191, but our crossing net found 193 news, less two.
Why? We notice that log has two error:
Positioning problems: The original found that the College news column there are two hidden two-level columns:
Like what:


Paste_image.png


The corresponding URL is


Paste_image.png


The URLs are not the same, no wonder you can't catch them!
Then we have to set the two two-level columns of the URL of the special rules, only need to add to determine whether the level two columns:

  If Url.find (' type ')! =-1:      yield scrapy. Request (URL, Callback=self.parse)

Assemble the original function:

Import scrapyfrom ggglxy.items Import ggglxyitemnext_page_num = 1class News2spider (scrapy.    Spider): name = "News_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def Parse (self, Response): as href in Response.xpath ("//div[@class = ' Newsinfo_box cf ']"): URL = Response.urljoi N (Href.xpath ("div[@class = ' News_c fr ']/h3/a/@href"). Extract_first ()) If Url.find (' type ')! = -1:yield Scrapy. Request (URL, callback=self.parse) yield scrapy. Request (URL, callback=self.parse_dir_contents) global Next_page_num next_page_num = next_page_num + 1if NEXT _page_num<11:next_url = ' http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s '% next_page_nu Myield Scrapy.             Request (Next_url, Callback=self.parse) def parse_dir_contents (self, response): item = Ggglxyitem () item[' Date ' = Response.xpath ("//div[@class = ' detail_zy_title ']/p/text ()"). Extract_first () item[' href '] = respon          Se   item[' title ' = Response.xpath ("//div[@class = ' detail_zy_title ']/h1/text ()"). Extract_first () data = response. XPath ("//div[@class = ' Detail_zy_c pb30 mb30 ']") item[' content '] = Data[0].xpath (' string (.) '). Extract () [0] Yield item

Test:


Paste_image.png

We found that the captured data was increased from the previous 193 to 238, and the log has no error, indicating that our crawl rule ok!

4. Get Fetch data

     news_info_2 -o 0016.json

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.