PHP curl captures AJAX asynchronous content, curlajax
In fact, the page for capturing ajax asynchronous content is not much different from the page for capturing common content. Ajax only implements an asynchronous http request. You only need to use tools similar to firebug to find the request's backend service url and value passing parameters, and then crawl the url passing parameters.
Network Tools using Firebug
Code
$ Cookie_file = tempnam ('. /temp ', 'cookier'); $ ch = curl_init (); $ url1 = "http://www.cdut.edu.cn/default.html"; curl_setopt ($ ch, CURLOPT_URL, $ url1); curl_setopt ($ ch, CURLOPT_HTTP_VERSION, authorization); curl_setopt ($ ch, CURLOPT_HEADER, 0); curl_setopt ($ ch, expires, 1); curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt ($ ch, CURLOPT_ENCODING, 'gzip '); // Add gzip resolution // set the file curl_setopt ($ ch, CURLOPT_COOKIEJAR, $ cookie_file) for storing cookie information after the connection ends ); $ content = curl_exec ($ ch); curl_close ($ ch); $ ch3 = curl_init (); $ url3 = "http://www.cdut.edu.cn/xww/dwr/call/plaincall/portalAjax.getNewsXml.dwr "; $ curlPost = "callCount = 1 & page =/xww/type/custom 20118.html & httpSessionId = Role & scriptSessionId = Role & c0-scriptName = portalAjax & c0-methodName = getNewsXml & c0-id = 0 & c0-param0 = string: 10000201 & c0-param1 = string: 1000020118 & c0-param2 = string: news _ & c0-param3 = number: 5969 & c0-param4 = number: 1 & c0-param5 = null: null & c0-param6 = null: null & batchId = 0 "; curl_setopt ($ ch3, CURLOPT_URL, $ url3); curl_setopt ($ ch3, CURLOPT_POST, 1); curl_setopt ($ ch3, CURLOPT_POSTFIELDS, $ curlPost ); // set the file curl_setopt ($ ch3, CURLOPT_COOKIEFILE, $ cookie_file) for storing cookie information after the connection ends; $ content1 = curl_exec ($ ch3); curl_close ($ ch3 );I am the dividing line of tiantiao
Php curl crawls ajax data for a period of time without response
Attempt to forge header information: Host, Referer, User-Agent, etc.
Php uses curl to capture the content of a website and is rejected.
Just written. Hope to be useful
<? Php $ binfo = array ('mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0 ;. net clr 2.0.50727; InfoPath.2; AskTbPTV/5.17.0.25589; Alexa Toolbar) ', 'mozilla/5.0 (Windows NT 5.1; rv: 22.0) Gecko/20100101 Firefox/123456 ', 'mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0 ;. NET4.0C; Alexa Toolbar) ', 'mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)', $ _ SERVER ['HTTP _ USER_AGENT ']); // 218.242.124.16 * // 125.90.88. * $ cip = '2017. 242. 124. '. mt_rand (0,254); $ xip = '2017. 242. 124. '. mt_rand (0,254); $ header = array ('client-IP :'. $ cip, 'x-FORWARDED-:'. $ xip,); function getimgs ($ url, $ data, $ userinfo, $ header) {$ ch = curl_init (); $ timeout = 5; curl_setopt ($ ch, CURLOPT_URL, "$ url"); curl_setopt ($ ch, CURLOPT_HTTPHEADER, $ header); curl_setopt ($ ch, CURLOPT_REFERER, "www.sgs.gov.cn/lz/etpsInfo.do? Method = index "); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt ($ ch, CURLOPT_POST, 1); curl_setopt ($ ch, CURLOPT_POSTFIELDS, $ data ); curl_setopt ($ ch, CURLOPT_USERAGENT, "$ userinfo"); curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT, $ timeout); $ contents = curl_exec ($ ch); curl_close ($ ch ); return $ contents;} $ url = '...... remaining full text>