Use scrapy crawlers to crawl today's headlines homepage featured News (SCRAPY+SELENIUM+PHANTOMJS)

Last Update:2018-07-13 Source: Internet

Author: User

Tags dcap

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawl Today Headlines https://www.toutiao.com/homepage Recommended news, open the URL to get the following interface

View source code you will find

All is the JS code, shows that the content of today's headlines is generated by JS dynamic.

Use Firefox browser F12 to see

Get the featured news for today's headlines interface address: https://www.toutiao.com/api/pc/focus/

Access this address alone to get

The data format that this interface obtains is JSON data

We use SCRAPY+SELENIUM+PHANTOMJS to get today's headline recommendations

The following is the most core code in Scrapy, located in the Spiders toutiao_example.py

#-*-Coding:utf-8-*-import scrapyimport jsonfrom Selenium import webdriverfrom selenium.webdriver.common.desired_ Capabilities Import Desiredcapabilitiesimport Timeimport Reclass toutiaoexamplespider (scrapy. Spider): name = ' toutiao_example ' allowed_domains = [' toutiao.com '] start_urls = [' https://www.toutiao.com/api/pc /focus/'] # # #今日头条焦点的api接口 def parse (self, Response): #print (response,123) conten_json=json.loads (response . Text) # # #得到json数据 #print (Conten_json,type (Conten_json), 456) conten_news=conten_json[' data '] # # # #从json数据中抽取da Ta field data, where the data field contains the Pc_feed_focus field, where the field contains: Title of the news, link URL and other information #print (conten_news,789) for AA in Con ten_news[' Pc_feed_focus ': #print (aa) title=aa[' title '] link_url= ' https://www.toutiao.com ' +aa[' display_url ' # # #如果写 (www.toutiao.com ' +aa[' display_url ') will error, plus https://, (https://www.toutiao.com ' +aa[')            Display_url ']) will not be an error! #print (Link_url) link_url_new=lInk_url.replace (' group/', ' A ') # # #把链接https://www.toutiao.com/group/6574248586484122126/, put in the browser, the address will automatically become https:// www.toutiao.com/a6574248586484122126/this. So we need to replace group/with a #print (link_url_new) yield scrapy. Request (Link_url_new, Callback=self.next_parse) def next_parse (self, response): Dcap = Dict (desiredcapabilities. PHANTOMJS) # set useragent info dcap[' phantomjs.page.settings.userAgent '] = (' mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;        rv:25.0) gecko/20100101 firefox/25.0 ') # set specific browser information as needed # # #r "D:\\phantomjs-2.1.1-windows\bin\phantomjs.exe", Driver = Webdriver. PHANTOMJS (Desired_capabilities=dcap) #封装浏览器信息) # Specifies the browser to use, #driver. Set_page_load_timeout (5) # Set timeout time drive               R.get (response.url) # #使用浏览器请求页面 Time.sleep (3) #加载3秒, waiting for all data to load # # #通过class来定位元素属性 # # # #title是标题 Title=driver.find_element_by_class_name (' title '). Text ###.text Gets the textual data of the element Content1=driver.find_element_by_class_ Name (' abstract-Index '). Text###.text gets the text data of the element Content2=driver.find_element_by_class_name (' abstract '). Text###.text gets the text data of the element # # #co       Ntent is content Content=content1+content2 print (title,content,6666666666666666) # Close Browser driver.close () # data = driver.page_source# get Web page text # driver.save_screenshot (' 1.jpg ') # system Save

Run the code we get the result as the title plus the content is presented in the following way

Use scrapy crawlers to crawl today's headlines homepage featured News (SCRAPY+SELENIUM+PHANTOMJS)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More