Python web crawler (iii)

Last Update:2018-04-07 Source: Internet

Author: User

Tags button type python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Ajax Learning

ajax=asynchronous JavaScript and XML (asynchronous JavaScript and XML). In layman's terms, Ajax is a technique for creating fast, Dynamic Web pages without having to load the entire Web page, by making small data exchanges in the background with the server, and by updating some of the Web pages.

Send a request to the server in response to the server

The send request can take advantage of the open () and send () methods of the XMLHttpRequest object.

Method	Description
Open (Method,url,async)	Specifies the type of request, the URL, and whether the request is processed asynchronously. Method: type of request; GET or Post;url: The location of the file on the server; Async:true (asynchronous) or false (synchronous)
Send (String)	Sends the request to the server. String: Only for POST requests

The response of the server requires the use of the ResponseText or Responsexml property of the XMLHttpRequest object.

Properties	Description
ResponseText	Gets the response data in the form of a string.
Responsexml	Get the response data in XML form.

Here's a simple code:

<Html><Head><Script type="Text/javascript">function Loadxmldoc(){varxmlHTTP;if(window.XMLHttpRequest){xmlHTTP=New XMLHttpRequest();  }XMLHTTP.onreadystatechange=function(){  if(XMLHTTP.readyState==4 && XMLHTTP.Status== $){    Document.getElementById("Mydiv").InnerHTML=XMLHTTP.ResponseText;    }if(XMLHTTP.Status==404){Document.Write("The file doesn ' t exist.");}  }XMLHTTP.Open("POST","/ajax/demo.asp",true);XMLHTTP.setRequestHeader("Content-type","application/x-www-form-urlencoded");XMLHTTP.Send("Fname=bill&lname=gates");}</script>><Body><H2>One of my test cases</h2><button type= "button" onclick= "Loadxmldoc()"> Make a request </button><Div ID="Mydiv"></div> </body></html>

You can see the appropriate usage on the W3wschool test site.
In the code xmlhttp.onreadystatechange=function() , the onreadystatechange is an event, meaning that when readystate changes, the onReadyStateChange event is automatically triggered. The following are three important properties of the XMLHttpRequest object:

Properties	Description
onReadyStateChange	The function (or function name) is called whenever the ReadyState property is changed.
ReadyState	The state of being xmlhttprequest. Vary from 0 to 4. 0: Request uninitialized; 1: Server connection established; 2: Request received; 3: request processing; 4: Request completed and response ready
Status	: "OK"; 404: Page Not Found

Web crawler-related python angle analysis Ajax

Below we from the point of view of Python crawler Ajax thinking, can be divided into send request--parsing content, rendering Web Page three steps. The sending request, which is implemented using JavaScript code, creates a new XMLHttpRequest object, then sets the listener with the onReadyStateChange property and then sends the request using the open () and send () methods. And we can get a response directly from the request in Python. Parsing the content is when the request is sent and the response from the server is received, and once again the onReadyStateChange event is triggered, and the content of the response can be obtained with the ResponseText property of XMLHTTP. A process similar to Python that uses requests to initiate a request to the server and then get a response. Then the return content might be HTML, possibly Json, and then only need to be further processed in the method with JavaScript. The last step is to render the Web page, we all know that JavaScript can change the content of the Web page, and the code in document.getElementById().innerHTM this way is to change the source code within an element, so that the content of the page display changes, this is the document Web page documents, which is the DOM operation. In Python we also use the BeautifulSoup library to parse the Web page, find the node and then extract, Ajax only at the corresponding node changes to the server returned content, so the original page code is very small, But rendering is rendered by the response of the server.

Ajax analysis

First we go into the F12 developer mode in the browser, go into the store and refresh a page (here to use Firefox to open the microblog for example), there will be a lot of requests, which is actually in the page loading process between the browser and the server send request and receive Response all records.

and the AJAX request type is XHR, you can see the following 3 requests, we casually go into one to see:

And the first of all our requests is the source code of the Web page, basically just write out the node, no rendering code. When we load new content, there will be new AJAX requests, such as the author here to provide three request addresses:

https://m.weibo.cn/api/container/getIndex?type=uid&value=2145291155&containerid=1005052145291155https://m.weibo.cn/api/container/getIndex?type=uid&value=2145291155&containerid=1076032145291155https://m.weibo.cn/api/container/getIndex?type=uid&value=2145291155&containerid=1076032145291155&page=2

Each update will have a new request. So we need to use Python to realize the simulation of the AJAX request, so that the data can be crawled.

Python Extracts Ajax data

First of all, the author listed above the three request address, found that the address contains the type, value, Containerid and page, the first two addresses plus page=1 can also be requested, the browser automatically omitted. Observe these requests and find their type, value, and containerid consistent. The value of type always Uid,value is the number in the link of the page, in fact this is the user's ID, there is also a containerid, after observing that it is 107603 and then add the user ID. So the value of the change is page, it is obvious that this parameter is used to control the paging, page=1 for the first page, page=2 for the second page, and so on.

Then we just need to take these parameters as request parameter requests and then fetch the JSON data.

Python web crawler (iii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More