Today's headlines this kind of website production, from the data form, CSS style is determined by the data interface style, so its crawl method and other Web page crawl method is not the same, the crawl needs to crawl back to the JSON data, first look at the source structure of today's headlines: We grab the title of the article, Picture link on the details page try it:
See the above source, crawl down no use, then I look at its background data: '
All the data is in the JSON display in the background, so we need to crawl the data through the interface
Extracting JSON data from a Web page
Execute the function result if you want to crawl a lot and remember to turn on multi-process and deposit the database:
Look at the results:
Summary: Many of the online crawl of today's headlines are the first to grab the page, get the URL of the article and then go through the Details page, and then crawl on the details page, but today's headline site is like this, in the page's interface data with the detail page data, Save time and reduce the amount of code by sending data to the page template of the details page by clicking Jump to carry Data
Site Crawl-Case three: Today's headline crawl (Ajax crawl JS data)