Use scrapy crawlers to crawl today's headlines homepage featured News (SCRAPY+SELENIUM+PHANTOMJS)

Source: Internet
Author: User
Tags dcap

Crawl Today Headlines https://www.toutiao.com/homepage Recommended news, open the URL to get the following interface

View source code you will find

All is the JS code, shows that the content of today's headlines is generated by JS dynamic.

Use Firefox browser F12 to see

Get the featured news for today's headlines interface address: https://www.toutiao.com/api/pc/focus/

Access this address alone to get

The data format that this interface obtains is JSON data

We use SCRAPY+SELENIUM+PHANTOMJS to get today's headline recommendations

The following is the most core code in Scrapy, located in the Spiders toutiao_example.py

#-*-Coding:utf-8-*-import scrapyimport jsonfrom Selenium import webdriverfrom selenium.webdriver.common.desired_ Capabilities Import Desiredcapabilitiesimport Timeimport Reclass toutiaoexamplespider (scrapy. Spider): name = ' toutiao_example ' allowed_domains = [' toutiao.com '] start_urls = [' https://www.toutiao.com/api/pc /focus/'] # # #今日头条焦点的api接口 def parse (self, Response): #print (response,123) conten_json=json.loads (response . Text) # # #得到json数据 #print (Conten_json,type (Conten_json), 456) conten_news=conten_json[' data '] # # # #从json数据中抽取da Ta field data, where the data field contains the Pc_feed_focus field, where the field contains: Title of the news, link URL and other information #print (conten_news,789) for AA in Con ten_news[' Pc_feed_focus ': #print (aa) title=aa[' title '] link_url= ' https://www.toutiao.com ' +aa[' display_url ' # # #如果写 (www.toutiao.com ' +aa[' display_url ') will error, plus https://, (https://www.toutiao.com ' +aa[')            Display_url ']) will not be an error! #print (Link_url) link_url_new=lInk_url.replace (' group/', ' A ') # # #把链接https://www.toutiao.com/group/6574248586484122126/, put in the browser, the address will automatically become https:// www.toutiao.com/a6574248586484122126/this. So we need to replace group/with a #print (link_url_new) yield scrapy. Request (Link_url_new, Callback=self.next_parse) def next_parse (self, response): Dcap = Dict (desiredcapabilities. PHANTOMJS) # set useragent info dcap[' phantomjs.page.settings.userAgent '] = (' mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;        rv:25.0) gecko/20100101 firefox/25.0 ') # set specific browser information as needed # # #r "D:\\phantomjs-2.1.1-windows\bin\phantomjs.exe", Driver = Webdriver. PHANTOMJS (Desired_capabilities=dcap) #封装浏览器信息) # Specifies the browser to use, #driver. Set_page_load_timeout (5) # Set timeout time drive               R.get (response.url) # #使用浏览器请求页面 Time.sleep (3) #加载3秒, waiting for all data to load # # #通过class来定位元素属性 # # # #title是标题 Title=driver.find_element_by_class_name (' title '). Text ###.text Gets the textual data of the element Content1=driver.find_element_by_class_ Name (' abstract-Index '). Text###.text gets the text data of the element Content2=driver.find_element_by_class_name (' abstract '). Text###.text gets the text data of the element # # #co       Ntent is content Content=content1+content2 print (title,content,6666666666666666) # Close Browser driver.close () # data = driver.page_source# get Web page text # driver.save_screenshot (' 1.jpg ') # system Save

Run the code we get the result as the title plus the content is presented in the following way

Use scrapy crawlers to crawl today's headlines homepage featured News (SCRAPY+SELENIUM+PHANTOMJS)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.