Scrapy crawls school news reports and scrapy crawls news
Capture all the news and consulting on the official website of Sichuan University Institute of Public Management (http://ggglxy.scu.edu.cn.
Lab Process
1. Determine the capture target.
2. Create capture rules.
3. 'write/debug' capture rules.
4. Obtain captured data
1. Determine the capture target
We need to capture all the news and information of the Public Management College of Sichuan University. Therefore, we need to know the layout structure of the official website of the Public Management College.
_20170515223045.png
We found that we wanted to capture all the news information and could not directly crawl it on the homepage of the official website. We needed to click "more" to enter the news topic.
Paste_Image.png
We have seen specific news topics, but this obviously does not meet our crawling needs: currently, dynamic news webpages can only capture news time, titles and URLs, but cannot capture news content. so we want to go to the news details page to capture the specific news content.
2. Create capture rules
Through the analysis in the first part, we will think that if we want to capture the specific news information, we need to click the news dynamic page to go to the news details page to capture the specific news content. let's click on a piece of news to try it out.
Paste_Image.png
We found that we can directly capture the required data on the news details page: title, time, content. URL.
Well, now we have a clear idea of capturing a news article. But how can we capture all the news content?
This is obviously hard for us.
We can see the page Jump button at the bottom of the news topic. Then we can capture all the news through the "next page" button.
Now, we can think of an obvious crawling rule:
Capture all the news links under the news topic and go to the news details link to capture all the news content.
3. 'write/debug' capture rules
To minimize the granularity of the debugging crawler, I will combine the compilation and debugging modules.
In crawler, I will implement the following functions:
1. Crawl all the news links under a news topic
2. Access the required data (mainly news content) for detailed news crawling through the crawled news link)
3. Crawl all news through loop.
The corresponding knowledge points are:
1. Crawl the basic data under a page.
2. Perform secondary crawling through the crawled data.
3. cyclically crawl all data on the webpage.
Let's just talk about it.
3.1 crawl all the news links under a news topic
Paste_Image.png
By analyzing the source code of the news topic, we find that the structure of the captured data is
Paste_Image.png
Then we only need to locate the crawler selector (li: newsinfo_box_cf) and then perform for loop crawling.
Write code
import scrapyclass News2Spider(scrapy.Spider): name = "news_info_2" start_urls = [ "http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ] def parse(self, response): for href in response.xpath("//div[@class='newsinfo_box cf']"): url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())
Test, pass!
Paste_Image.png3.2 go to the news details page to crawl the required data (mainly news content)
Now I have obtained a set of URLs. Now I need to go to every URL to capture the title, time, and content I need. The code implementation is quite simple, you only need to input the URL and capture the corresponding data when the original code captures a URL. therefore, I only need to write another method to capture the news details page and use scapy. the request can be called.
Write code
# Capture method def parse_dir_contents (self, response): item = GgglxyItem () item ['date'] = response. xpath ("// div [@ class = 'detail _ zy_title ']/p/text ()"). extract_first () item ['href '] = response item ['title'] = response. xpath ("// div [@ class = 'detail _ zy_title ']/h1/text ()"). extract_first () data = response. xpath ("// div [@ class = 'detail _ zy_c pb30 mb30']") item ['content'] = data [0]. xpath ('string (.) '). extract () [0] yield item
After integration with the original code, there are:
Import scrapyfrom ggglxy. items import GgglxyItemclass News2Spider (scrapy. Spider): name = "maid" start_urls = ["http://ggglxy.scu.edu.cn/index.php? C = special & sid = 1 & page = 1 ",] def parse (self, response): for href in response. xpath ("// div [@ class = 'newsinfo _ box cf']"): url = response. urljoin (href. xpath ("div [@ class = 'news _ c fr ']/h3/a/@ href "). extract_first () # call the news capture method yield scrapy. request (url, callback = self. parse_dir_contents) # The capture method def parse_dir_contents (self, response): item = GgglxyItem () item ['date'] = response. xpath ("// div [@ class = 'detail _ zy_title ']/p/text ()"). extract_first () item ['href '] = response item ['title'] = response. xpath ("// div [@ class = 'detail _ zy_title ']/h1/text ()"). extract_first () data = response. xpath ("// div [@ class = 'detail _ zy_c pb30 mb30']") item ['content'] = data [0]. xpath ('string (.) '). extract () [0] yield item
Test, pass!
Paste_Image.png
Then we add a loop:
NEXT_PAGE_NUM = 1 NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1 if NEXT_PAGE_NUM<11: next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUM yield scrapy.Request(next_url, callback=self.parse)
Add to the original code:
import scrapyfrom ggglxy.items import GgglxyItemNEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider): name = "news_info_2" start_urls = [ "http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ] def parse(self, response): for href in response.xpath("//div[@class='newsinfo_box cf']"): URL = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first()) yield scrapy.Request(URL, callback=self.parse_dir_contents) global NEXT_PAGE_NUM NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1 if NEXT_PAGE_NUM<11: next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUM yield scrapy.Request(next_url, callback=self.parse) def parse_dir_contents(self, response): item = GgglxyItem() item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first() item['href'] = response item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first() data = response.xpath("//div[@class='detail_zy_c pb30 mb30']") item['content'] = data[0].xpath('string(.)').extract()[0] yield item
Test:
Paste_Image.png
The number of captured items is 191, but we see that there are 193 news articles published on the official website, two fewer.
Why? We noticed that there are two log errors:
Problem locating: It was found that there are two hidden secondary columns in the school's news section:
For example:
Paste_Image.png
The corresponding URL is
Paste_Image.png
The URLs are different. No wonder you can't catch them!
Then we have to set special rules for the URLs of the two second-level columns. We only need to add the rules to determine whether the two second-level columns are used:
if URL.find('type') != -1: yield scrapy.Request(URL, callback=self.parse)
Assemble the original function:
import scrapyfrom ggglxy.items import GgglxyItemNEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider): name = "news_info_2" start_urls = [ "http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1", ] def parse(self, response): for href in response.xpath("//div[@class='newsinfo_box cf']"): URL = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first()) if URL.find('type') != -1: yield scrapy.Request(URL, callback=self.parse) yield scrapy.Request(URL, callback=self.parse_dir_contents) global NEXT_PAGE_NUM NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1 if NEXT_PAGE_NUM<11: next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUM yield scrapy.Request(next_url, callback=self.parse) def parse_dir_contents(self, response): item = GgglxyItem() item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first() item['href'] = response item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first() data = response.xpath("//div[@class='detail_zy_c pb30 mb30']") item['content'] = data[0].xpath('string(.)').extract()[0] yield item
Test:
Paste_Image.png
We found that the number of captured data records increased from the previous 193 to 238, and there was no error in the log. It indicates that our crawling rule is OK!
4. Obtain captured data
scrapy crawl news_info_2 -o 0016.json
If you have any questions during the learning process or want to obtain learning resources, join the learning exchange group.
626062078. Let's learn Python together!