Crawl Sichuan University Public Management Institute website () all the press inquiries.
Experimental process
1. Determine the fetch target.
2. Develop crawl rules.
3. ' Write/debug ' crawl rules.
4. Get Fetch data
1. Determine the fetch target
The goal we need to crawl this time is all the news from Sichuan University School of Public administration. So we need to know the layout of the official website of the School of Public Management.
_20170515223045.png
Here we found that want to catch all the news information, not directly on the homepage of the website to crawl, need to click "More" into the general news column inside.
Paste_image.png
We saw the specific news column, but this obviously does not meet our crawling needs: the current news page can only crawl news time, title and URL, but does not crawl the content of the news. So we want to go to the news detail page to grab the details of the news.
2. Develop crawl rules
Through the first part of the analysis, we will think that if we want to crawl a piece of news specific information, we need to click on the News page to enter the news details page crawl to the specific content of the news. Let's try it on a piece of news.
Paste_image.png
We found that we were able to capture the data we needed directly on the News detail page: Title, time, content. Url.
Well, now we know the idea of grabbing a piece of news. But how to crawl all the news content?
This is obviously hard for us.
We can see the page Jump button at the bottom of the news column. Then we could crawl all the news with the "next Page" button.
So tidy up your ideas and we can think of an obvious crawl rule:
Crawl through all the news links under ' news section ' and go to the news details link to grab all the news content.
3. ' Write/debug ' crawl rules
In order to make the granularity of the debug crawler as small as possible, I will write and debug modules together.
In the crawler, I will implement the following functional points:
1. Crawl out of one page of all news links under the news column
2. Crawl through a page of news links into the news details crawl the required data (mainly news content)
3. Crawl through the loop to all the news.
The respective points of knowledge are:
1. Crawl the underlying data from a page.
2. Crawl through the data for two times.
3. Crawl all data by looping through the page.
Talk not much, now open dry.
3.1 Crawling out of one page all news links under the news column
Paste_image.png
Through the analysis of the source code of the News column, we find that the structure of the captured data is
Paste_image.png
Then we only need to locate the crawler selector (LI:NEWSINFO_BOX_CF), and then for the For Loop crawl.
Writing code
Import Scrapyclass News2spider (scrapy. Spider): name = "News_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1& Page=1 ", ]def Parse (self, Response): for href in Response.xpath ("//div[@class = ' Newsinfo_box cf '] "): URL = Response.urljoin (Href.xpath ("div[@class = ' News_c fr ']/h3/a/@href"). Extract_first ())
Test, pass!
Paste_image.png
3.2 By crawling to a page of news links into the news details crawl the required data (mainly news content)
Now I get a set of URLs, now I need to go into each URL to grab the title, time and content I need, the code implementation is also very simple, just need to catch a URL in the original code to enter the URL and fetch the corresponding data. So, I just need to write a crawl that goes to the news detail page, And you can use the Scapy.request call.
Writing code
#进入新闻详情页的抓取方法def parse_dir_contents (Self, response): item = Ggglxyitem () item[' Date ' = Response.xpath ("//div[@class = ') Detail_zy_title ']/p/text () "). Extract_first () item[' href '] = responseitem[' title ' = Response.xpath ("//div[@class = ') Detail_zy_title ']/h1/text () "). Extract_first () data = Response.xpath ("//div[@class = ' Detail_zy_c pb30 mb30 '] ") item[' content ' = Data[0].xpath (' string (.) '). Extract () [0] yield item
After integration into the original code, there are:
import scrapyfrom ggglxy.items import ggglxyitemclass news2spider (scrapy. Spider): name = "News_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def Parse (self, Response): as href in Response.xpath ("//div[@class = ' Newsinfo_box cf ']"): url = Response.urljoi N (Href.xpath ("div[@class = ' News_c fr ']/h3/a/@href"). Extract_first ()) #调用新闻抓取方法yield scrapy. Request (URL, callback=self.parse_dir_contents) #进入新闻详情页的抓取方法 def parse_dir_contents (self, Response): item = Ggglxyitem () item[' Date ' = Response.xpath ("//div[@class = ' detail_zy_title ']/p/text ()"). Extract_first () item[' href '] = response item[' title '] = Response.xpath ("//div[@class = ' detail_zy_title ']/h1/text () "). Extract_first () data = Response.xpath ("//div[@class = ' Detail_zy_c pb30 mb30 '] ") item[' content '] = Data[0].xpath (' string (.) '). Extract () [0]yield Item
Test, pass!
Paste_image.png
Then we add a loop:
Next_page_num = 1 Next_page_num = next_page_num + 1if next_page_num<11:next_url = ' http://ggglxy.scu.edu.cn/index.php ? c=special&sid=1&page=%s '% next_page_num yield scrapy. Request (Next_url, Callback=self.parse)
Add to Original code:
Import scrapyfrom ggglxy.items Import ggglxyitemnext_page_num = 1class News2spider (scrapy. Spider): name = "News_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def Parse (self, Response): as href in Response.xpath ("//div[@class = ' Newsinfo_box cf ']"): URL = Response.urljoi N (Href.xpath ("div[@class = ' News_c fr ']/h3/a/@href"). Extract_first ()) yield scrapy. Request (URL, callback=self.parse_dir_contents) global Next_page_num next_page_num = next_page_num + 1if next_page_nu M<11:next_url = ' http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s '% Next_page_numyield s Crapy. Request (Next_url, Callback=self.parse) def parse_dir_contents (self, response): item = Ggglxyitem () item[' Date ' = Response.xpath ("//div[@class = ' detail_zy_title ']/p/text ()"). Extract_first () item[' href '] = respon Se item[' title ' = Response.xpath ("//div[@class = ' detail_zy_title ']/h1/text ()"). EXtract_first () data = Response.xpath ("//div[@class = ' Detail_zy_c pb30 mb30 ']") item[' content '] = data [0].xpath (' String (.) '). Extract () [0] Yield item
Test:
Paste_image.png
The number of catches is 191, but our crossing net found 193 news, less two.
Why? We notice that log has two error:
Positioning problems: The original found that the College news column there are two hidden two-level columns:
Like what:
Paste_image.png
The corresponding URL is
Paste_image.png
The URLs are not the same, no wonder you can't catch them!
Then we have to set the two two-level columns of the URL of the special rules, only need to add to determine whether the level two columns:
If Url.find (' type ')! =-1: yield scrapy. Request (URL, Callback=self.parse)
Assemble the original function:
Import scrapyfrom ggglxy.items Import ggglxyitemnext_page_num = 1class News2spider (scrapy. Spider): name = "News_info_2" start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ]def Parse (self, Response): as href in Response.xpath ("//div[@class = ' Newsinfo_box cf ']"): URL = Response.urljoi N (Href.xpath ("div[@class = ' News_c fr ']/h3/a/@href"). Extract_first ()) If Url.find (' type ')! = -1:yield Scrapy. Request (URL, callback=self.parse) yield scrapy. Request (URL, callback=self.parse_dir_contents) global Next_page_num next_page_num = next_page_num + 1if NEXT _page_num<11:next_url = ' http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s '% next_page_nu Myield Scrapy. Request (Next_url, Callback=self.parse) def parse_dir_contents (self, response): item = Ggglxyitem () item[' Date ' = Response.xpath ("//div[@class = ' detail_zy_title ']/p/text ()"). Extract_first () item[' href '] = respon Se item[' title ' = Response.xpath ("//div[@class = ' detail_zy_title ']/h1/text ()"). Extract_first () data = response. XPath ("//div[@class = ' Detail_zy_c pb30 mb30 ']") item[' content '] = Data[0].xpath (' string (.) '). Extract () [0] Yield item
Test:
Paste_image.png
We found that the captured data was increased from the previous 193 to 238, and the log has no error, indicating that our crawl rule ok!
4. Get Fetch data
news_info_2 -o 0016.json