Crawl Today Headlines https://www.toutiao.com/homepage Recommended news, open the URL to get the following interface
View source code you will find
All is the JS code, shows that the content of today's headlines is generated by JS dynamic.
Use Firefox browser F12 to see
Get the featured news for today's headlines interface address: https://www.toutiao.com/api/pc/focus/
Access this address alone to get
The data format that this interface obtains is JSON data
We use SCRAPY+SELENIUM+PHANTOMJS to get today's headline recommendations
The following is the most core code in Scrapy, located in the Spiders toutiao_example.py
#-*-Coding:utf-8-*-import scrapyimport jsonfrom Selenium import webdriverfrom selenium.webdriver.common.desired_ Capabilities Import Desiredcapabilitiesimport Timeimport Reclass toutiaoexamplespider (scrapy. Spider): name = ' toutiao_example ' allowed_domains = [' toutiao.com '] start_urls = [' https://www.toutiao.com/api/pc /focus/'] # # #今日头条焦点的api接口 def parse (self, Response): #print (response,123) conten_json=json.loads (response . Text) # # #得到json数据 #print (Conten_json,type (Conten_json), 456) conten_news=conten_json[' data '] # # # #从json数据中抽取da Ta field data, where the data field contains the Pc_feed_focus field, where the field contains: Title of the news, link URL and other information #print (conten_news,789) for AA in Con ten_news[' Pc_feed_focus ': #print (aa) title=aa[' title '] link_url= ' https://www.toutiao.com ' +aa[' display_url ' # # #如果写 (www.toutiao.com ' +aa[' display_url ') will error, plus https://, (https://www.toutiao.com ' +aa[') Display_url ']) will not be an error! #print (Link_url) link_url_new=lInk_url.replace (' group/', ' A ') # # #把链接https://www.toutiao.com/group/6574248586484122126/, put in the browser, the address will automatically become https:// www.toutiao.com/a6574248586484122126/this. So we need to replace group/with a #print (link_url_new) yield scrapy. Request (Link_url_new, Callback=self.next_parse) def next_parse (self, response): Dcap = Dict (desiredcapabilities. PHANTOMJS) # set useragent info dcap[' phantomjs.page.settings.userAgent '] = (' mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) gecko/20100101 firefox/25.0 ') # set specific browser information as needed # # #r "D:\\phantomjs-2.1.1-windows\bin\phantomjs.exe", Driver = Webdriver. PHANTOMJS (Desired_capabilities=dcap) #封装浏览器信息) # Specifies the browser to use, #driver. Set_page_load_timeout (5) # Set timeout time drive R.get (response.url) # #使用浏览器请求页面 Time.sleep (3) #加载3秒, waiting for all data to load # # #通过class来定位元素属性 # # # #title是标题 Title=driver.find_element_by_class_name (' title '). Text ###.text Gets the textual data of the element Content1=driver.find_element_by_class_ Name (' abstract-Index '). Text###.text gets the text data of the element Content2=driver.find_element_by_class_name (' abstract '). Text###.text gets the text data of the element # # #co Ntent is content Content=content1+content2 print (title,content,6666666666666666) # Close Browser driver.close () # data = driver.page_source# get Web page text # driver.save_screenshot (' 1.jpg ') # system Save
Run the code we get the result as the title plus the content is presented in the following way
Use scrapy crawlers to crawl today's headlines homepage featured News (SCRAPY+SELENIUM+PHANTOMJS)