About Curl Jump crawl

Source: Internet
Author: User

Today in the company encountered a bug, that has been used to download the MP3 audio file is empty, but the browser to get the request is a file, and the size is not 0kb, but I use curl download down is 0K, think no solution. Finally Kung Fu, got the method, the original I have to go to the third-party interface to get the recording data, but today's recording data address jumps, that is, the first time the address of the request returned is 302,

This is the previous code

$ch = curl_init (); curl_setopt ($ch, Curlopt_url, $url);//curl_setopt ($ch, Curlinfo_header_out, TRUE);  curl_setopt ($ch, Curlopt_returntransfer, TRUE), curl_setopt ($ch, Curlopt_ssl_verifypeer, FALSE); curl_setopt ($ch, Curlopt_ssl_verifyhost, FALSE); $info = Curl_exec ($ch); 

In other words, curl in the first request, the server returned 302, in fact, is to jump, but curl is not the default jump, so $info has been empty

After improvement

$ch = curl_init (); curl_setopt ($ch, Curlopt_url, $url);//curl_setopt ($ch, Curlinfo_header_out, TRUE);  curl_setopt ($ch, Curlopt_returntransfer, TRUE); curl_setopt ($ch, curlopt_followlocation, 1); curl_setopt ($ch, Curlopt_ssl_verifypeer, False); curl_setopt ($ch, Curlopt_ssl_verifyhost, FALSE); $info = curl_exec ($ch)   ;

This time is more curlopt_followlocation, said to allow curl to jump. There's $info data!

PS: About data

curl_setopt ( $ch , Curlopt_maxredirs,20 curl_setopt ( $ch , Curlopt_followlocation,1 Curlopt_followlocation means automatic jump fetching, Curlopt_maxredirs indicates the maximum number of jumps allowed. 
However, it is important to note that the curlopt_followlocation needs to be used when the Open_basedir is not set in safe mode off. Open_basedir is a setting in php.ini that restricts user-actionable files to a directory.
If you open a safe mode, or if you set the open_basedir, you cannot use automatic jump fetching, you can grab the final page with a continuous crawl method. To speed up and reduce unnecessary overhead, you can use
in the crawl of non-target pages in the middle of the process |
curl_setopt($rch, curlopt_header, TRUE);  curl_setopt($rch, curlopt_nobody, TRUE);
Only grab the header information, do not crawl the page content, the header information status code (301,302) to judge. If you want to jump, get the address from location to jump, crawl again until the status code is 200 state. Finally, the target page is crawled.

About Curl Jump crawl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.