Crawler "6" Ajax content resolution-today's headline Atlas

Source: Internet
Author: User
Tags urlencode

Ajax Technology

AJAX = Asynchronous JavaScript and XML (asynchronous JavaScript and XML).
Ajax is not a new programming language, but a new approach to using existing standards, and of course not very new, and in about 97 years, Microsoft has invented the key technologies of Ajax, but it has not been promoted; with the widespread use of Google Eath, Google suggest and Gmail, Ajax is just starting to get popular.
The biggest advantage of Ajax is that without reloading the entire page, you can exchange data with the server and update portions of the Web page.
Ajax does not require any browser plugins, but requires the user to allow JavaScript execution.

Application of Ajax

Use XHTML+CSS to express information;
Use JavaScript to manipulate the DOM (Document Object Model) to perform dynamic effects;
Use XML and XSLT to manipulate data;
Asynchronous data exchange with Web server using XMLHttpRequest or a new fetch API;
Note: Ajax is distinguished from RIA technologies such as Flash, Silverlight, and Java applets.

There are many examples of applications that use Ajax, such as Sina Weibo, Google Maps, and today's headlines.
Today we use today's headline to see how AJAX content is resolved, and how to crawl the content of such sites.

Search features for today's headlines

A few days ago has not been the headlines today, the estimated network supervision is too strict, many consulting services are down ...
Today finally can open, hurriedly talk about the content of Ajax.
"Insert images, search features for today's headlines"

As you can see in the image above, using the headlines to search for some keywords, you can return a lot of content, see four tags, synthesis, video, Atlas, and user, we're going to talk about this episode today. That is, the Label tab is set to the Atlas, and how to set it will be clear later.

Source Code for index page

Let's say we search for football keywords (haha, don't search for beauties ...) ), let's look at what the source code of the Web page is.
"Insert picture, index source code"

The source code in addition to some basic HTML tags, is a variety of JS, there is no we want some URL content or atlas information.
As mentioned above, today's headlines use AJAX technology to load content, so according to the characteristics of Ajax technology, certain data will be sent from the server to our browser, otherwise the page will not show the contents of these atlas.
So where is this data?

Where is the Ajax-loaded data?

Open the browser debugging, please select the Network tab, select XHR content, see the following several files appear.
"Insert picture, how to open Ajax loaded content"

We look at the types of these files, all in JSON format.
Look at the class of search_content, except for the value of offset change stop vomit, the others are the same. Since we have scrolled through the pages, and each page shows exactly 20 items, the reader will be able to understand this offset at once, which is the controller used to load multiple pages.
Let's look at a search_content message header:
"Insert Picture, message header of JSON"

At this time a GET request, we can use the requests library's get method directly to request to the JSON file. But what is the content of the URL?
We look at a few parameters, especially the last Cur_tab set to 3, because 3 is the choice of the Atlas, 1 is a synthesis, 2 is the video, mentioned above.
As long as we change the offset parameter, we can get multiple pages, 20 items per page.
Let's take a look at what the response is:
"Insert Picture, JSON response information"

Because it is JSON-formatted content, there are some key:value format content, we mainly focus on the following data 20 content, each content contains article_url keyword information, which is to open the URL of each atlas, We can access the specific atlas through this URL.

About the content of the site resolution here today, let's look at the code, how to get the URLs of each of these atlas.

1. Get the JSON content of index page
ImportRequests fromUrllib.parserImportUrlEncodedefGet_page_index (offset):#cur_tab标签一定要写正确, 3 represents the atlas, it is importantData={' offset ': Offset,' format ':' JSON ',' keyword ':' Football ',' AutoLoad ':' true ',' Count ':' A ',' Cur_tab ':' 3 '} URL=' https://www.toutiao.com/search_content/? '+UrlEncode (data)Try: Response=Requests.get (URL)ifResponse.status_code== $:#print (Response.text)            returnResponse.textElse:return None    except Exception:Print(' request index page Error! ')return None

We set an offset parameter so that we can control which page to get, that is, the ability to automatically swipe down.
Data is the parameter content of our URL at Get request, which we use as a dictionary to encode using UrlEncode.
The visit was smooth, and no additional header parameters were submitted.

2, the JSON content to parse
import  JSON def  parse_page_index (HTML): Data=  json.loads (HTML) result=  [] if  data and   Span class= "OP" >in  Data.keys (): for  item in  data.get (): Article_url=  item.get ( ' Article_url ' ) if  article_url and  ( "group"  in  article_url): Result.append (article_url) return  result  

Because the JSON content is parsed, the JSON library is imported.
We want to get the JSON content inside, the data tag under the Article_url information, so set up some filtering, data information must contain the ' data ' keyword to do the parsing.
Since we want the atlas, although the Cur_tab is set to 3, but some of the returned URLs are still not very canonical, we set in the URL must include the group string in order to view the drawing set.
Each URL is then added to the result list.

3. Open multi-process operation
fromimport Pooldef main(offset):    html=get_page_index(offset)    forin parse_page_index(html):        print(url)if__name__==‘__main__‘:    p=Pool()    p.map(main,[i*20forinrange(3)])

We first open 3 pages to try, using a multi-process can be faster, although the code is small, but the idea to master.
"Insert Picture, url result"

OK, let's get here today and talk about it tomorrow. How to get pictures in these URLs.

Crawler "6" Ajax content resolution-today's headline Atlas

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.