Spider captures dynamic content (pages pointed to by JavaScript)

Source: Internet
Author: User
Tags set cookie
For beginners in PHP, it is not difficult to track links when writing crawlers, but it is useless if it is a dynamic page. Maybe analyze the Protocol (but how to analyze it ?), Simulate the execution of JavaScript scripts (how to get it ?),...... In addition, it is possible to write a common Spider to crawl AJAX pages... for beginners in PHP, it is not difficult to track links when writing crawlers, but it is useless if it is a dynamic page.

Maybe analyze the Protocol (but how to analyze it ?), Simulate the execution of JavaScript scripts (how to get it ?),......

In addition, it may be complicated to write a common Spider that crawls AJAX pages. I have not heard of it or relevant open-source projects.

The problem is described below:

For example, on the next page of a page (the ajax function has a section in which the data corresponding to the obtained url is placed in the content tag ):

Javascript: Next page

The corresponding JavaScript code may be:

Function Down (index) {$ ("# pageindex "). val (parseInt (index) + 1); ajaxpage (parseInt (index) + 1);} function ajaxpage (index) {$. ajax ({type: "post", url: "class. aspx ", data:" Option = select & cid = "+ $ (" # classid "). val () + "& asc =" + $ ("# orderselect> option: selected "). val () + "& keyword =" + escape ($ ("# textfield "). val () + "& PI =" + index, success: function (data) {$ ("# content" ).html (data) ;}, error: function (data) {alert ("connection timed out. Try again later ! ");}}

Ps: I'm rolling over Stackoverflow and expect progress, but it may be faster to get answers here.

Reply content:

For beginners in PHP, it is not difficult to track links when writing crawlers, but it is useless if it is a dynamic page.

Maybe analyze the Protocol (but how to analyze it ?), Simulate the execution of JavaScript scripts (how to get it ?),......

In addition, it may be complicated to write a common Spider that crawls AJAX pages. I have not heard of it or relevant open-source projects.

The problem is described below:

For example, on the next page of a page (the ajax function has a section in which the data corresponding to the obtained url is placed in the content tag ):

Javascript: Next page

The corresponding JavaScript code may be:

Function Down (index) {$ ("# pageindex "). val (parseInt (index) + 1); ajaxpage (parseInt (index) + 1);} function ajaxpage (index) {$. ajax ({type: "post", url: "class. aspx ", data:" Option = select & cid = "+ $ (" # classid "). val () + "& asc =" + $ ("# orderselect> option: selected "). val () + "& keyword =" + escape ($ ("# textfield "). val () + "& PI =" + index, success: function (data) {$ ("# content" ).html (data) ;}, error: function (data) {alert ("connection timed out. Try again later ! ");}}

Ps: I'm rolling over Stackoverflow and expect progress, but it may be faster to get answers here.

There is no such extension in php (at least I haven't met it), but there are many html engine implementations when I was doing java. You can find it. For example

Http://lobobrowser.org/cobra.jsp

Without understanding PHP, I have been using java to catch things and talk about my practices.
For ajax requests, data in json or xml format is generally returned. When you open a webpage, you can use firebug to view the ajax request format sent in the background, set the request header information in the program. Some websites also need to set cookie information. Otherwise, no data will be returned and the cooike information can be found through firebug. Then, the request is generally sent successfully.
For websites that pass through the technology, the frequency of requests is limited. Therefore, pay attention to the frequency of sending requests.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.