Python web crawler (iii)

Source: Internet
Author: User
Tags button type python web crawler

Ajax Learning

ajax=asynchronous JavaScript and XML (asynchronous JavaScript and XML). In layman's terms, Ajax is a technique for creating fast, Dynamic Web pages without having to load the entire Web page, by making small data exchanges in the background with the server, and by updating some of the Web pages.

Send a request to the server in response to the server

The send request can take advantage of the open () and send () methods of the XMLHttpRequest object.

Method Description
Open (Method,url,async) Specifies the type of request, the URL, and whether the request is processed asynchronously. Method: type of request; GET or Post;url: The location of the file on the server; Async:true (asynchronous) or false (synchronous)
Send (String) Sends the request to the server. String: Only for POST requests

The response of the server requires the use of the ResponseText or Responsexml property of the XMLHttpRequest object.

Properties Description
ResponseText Gets the response data in the form of a string.
Responsexml Get the response data in XML form.

Here's a simple code:

<Html><Head><Script type="Text/javascript">function Loadxmldoc(){varxmlHTTP;if(window.XMLHttpRequest){xmlHTTP=New XMLHttpRequest();  }XMLHTTP.onreadystatechange=function(){  if(XMLHTTP.readyState==4 && XMLHTTP.Status== $){    Document.getElementById("Mydiv").InnerHTML=XMLHTTP.ResponseText;    }if(XMLHTTP.Status==404){Document.Write("The file doesn ' t exist.");}  }XMLHTTP.Open("POST","/ajax/demo.asp",true);XMLHTTP.setRequestHeader("Content-type","application/x-www-form-urlencoded");XMLHTTP.Send("Fname=bill&lname=gates");}</script>><Body><H2>One of my test cases</h2><button type= "button" onclick= "Loadxmldoc()"> Make a request </button><Div ID="Mydiv"></div> </body></html>

You can see the appropriate usage on the W3wschool test site.
In the code xmlhttp.onreadystatechange=function() , the onreadystatechange is an event, meaning that when readystate changes, the onReadyStateChange event is automatically triggered. The following are three important properties of the XMLHttpRequest object:

Properties Description
onReadyStateChange The function (or function name) is called whenever the ReadyState property is changed.
ReadyState The state of being xmlhttprequest. Vary from 0 to 4. 0: Request uninitialized; 1: Server connection established; 2: Request received; 3: request processing; 4: Request completed and response ready
Status : "OK"; 404: Page Not Found

Web crawler-related python angle analysis Ajax

Below we from the point of view of Python crawler Ajax thinking, can be divided into send request--parsing content, rendering Web Page three steps. The sending request, which is implemented using JavaScript code, creates a new XMLHttpRequest object, then sets the listener with the onReadyStateChange property and then sends the request using the open () and send () methods. And we can get a response directly from the request in Python. Parsing the content is when the request is sent and the response from the server is received, and once again the onReadyStateChange event is triggered, and the content of the response can be obtained with the ResponseText property of XMLHTTP. A process similar to Python that uses requests to initiate a request to the server and then get a response. Then the return content might be HTML, possibly Json, and then only need to be further processed in the method with JavaScript. The last step is to render the Web page, we all know that JavaScript can change the content of the Web page, and the code in document.getElementById().innerHTM this way is to change the source code within an element, so that the content of the page display changes, this is the document Web page documents, which is the DOM operation. In Python we also use the BeautifulSoup library to parse the Web page, find the node and then extract, Ajax only at the corresponding node changes to the server returned content, so the original page code is very small, But rendering is rendered by the response of the server.

Ajax analysis

First we go into the F12 developer mode in the browser, go into the store and refresh a page (here to use Firefox to open the microblog for example), there will be a lot of requests, which is actually in the page loading process between the browser and the server send request and receive Response all records.

and the AJAX request type is XHR, you can see the following 3 requests, we casually go into one to see:


And the first of all our requests is the source code of the Web page, basically just write out the node, no rendering code. When we load new content, there will be new AJAX requests, such as the author here to provide three request addresses:

https://m.weibo.cn/api/container/getIndex?type=uid&value=2145291155&containerid=1005052145291155https://m.weibo.cn/api/container/getIndex?type=uid&value=2145291155&containerid=1076032145291155https://m.weibo.cn/api/container/getIndex?type=uid&value=2145291155&containerid=1076032145291155&page=2

Each update will have a new request. So we need to use Python to realize the simulation of the AJAX request, so that the data can be crawled.

Python Extracts Ajax data

First of all, the author listed above the three request address, found that the address contains the type, value, Containerid and page, the first two addresses plus page=1 can also be requested, the browser automatically omitted. Observe these requests and find their type, value, and containerid consistent. The value of type always Uid,value is the number in the link of the page, in fact this is the user's ID, there is also a containerid, after observing that it is 107603 and then add the user ID. So the value of the change is page, it is obvious that this parameter is used to control the paging, page=1 for the first page, page=2 for the second page, and so on.

Then we just need to take these parameters as request parameter requests and then fetch the JSON data.

Python web crawler (iii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.