To crawl remote Web content via a local agent, the code is as follows:
array( 'proxy'=>'tcp://192.168.1.108:8087', 'request_fulluri '=>true, "method" => "GET", "timeout" => 2, ),);$context = stream_context_create($options);$fp = stream_socket_client("tcp://www.bigxu.com:80", $errno, $errstr, 30,STREAM_CLIENT_CONNECT,$context);// print_r(stream_context_get_options($fp)); exit;if (!$fp) { echo "$errstr ($errno)
\n";} else { fwrite($fp, "GET / HTTP/1.0\r\nHost: www.bigxu.com\r\nAccept: */*\r\n\r\n"); while (!feof($fp)) { echo fgets($fp, 1024); } fclose($fp);}?>
PHP file.php
The Nginx access log is:
15.196.206.102 [26/apr/2014:12:04:45 +0800] http://www.bigxu.com/200 20630 0.241 "-" "-"
15.196.206.102 is my native IP.
$context did not play a role.
Proxies are absolutely available.
Because with the code below, $context will work
$options = array( 'http'=>array( 'proxy'=>'tcp://192.168.1.108:8087', 'request_fulluri '=>true, "method" => "GET", "timeout" => 2, ), ); $context = stream_context_create($options); if ( $fp = fopen("http://www.bigxu.com", 'r', false, $context) ) { print "well done"; while (!feof($fp)) { echo fgets($fp, 1024); } }
PHP file.php
bigxu.com nginx access logs are:
8.35.201.32 [26/apr/2014:12:03:03 +0800] http://www.bigxu.com/200 7070 0.122 "-" "Appengine-google; (+http://code.google.com/appengine; appid:s~goagent0527) "
8.35.201.32 is my proxy IP.
Comparison of large-scale crawler projects, I will definitely use stream_socket_client to connect, you help me to see, this function, how do I use the wrong?
Reply content:
To crawl remote Web content via a local agent, the code is as follows:
array( 'proxy'=>'tcp://192.168.1.108:8087', 'request_fulluri '=>true, "method" => "GET", "timeout" => 2, ),);$context = stream_context_create($options);$fp = stream_socket_client("tcp://www.bigxu.com:80", $errno, $errstr, 30,STREAM_CLIENT_CONNECT,$context);// print_r(stream_context_get_options($fp)); exit;if (!$fp) { echo "$errstr ($errno)
\n";} else { fwrite($fp, "GET / HTTP/1.0\r\nHost: www.bigxu.com\r\nAccept: */*\r\n\r\n"); while (!feof($fp)) { echo fgets($fp, 1024); } fclose($fp);}?>
PHP file.php
The Nginx access log is:
15.196.206.102 [26/apr/2014:12:04:45 +0800] http://www.bigxu.com/200 20630 0.241 "-" "-"
15.196.206.102 is my native IP.
$context did not play a role.
Proxies are absolutely available.
Because with the code below, $context will work
$options = array( 'http'=>array( 'proxy'=>'tcp://192.168.1.108:8087', 'request_fulluri '=>true, "method" => "GET", "timeout" => 2, ), ); $context = stream_context_create($options); if ( $fp = fopen("http://www.bigxu.com", 'r', false, $context) ) { print "well done"; while (!feof($fp)) { echo fgets($fp, 1024); } }
PHP file.php
bigxu.com nginx access logs are:
8.35.201.32 [26/apr/2014:12:03:03 +0800] http://www.bigxu.com/200 7070 0.122 "-" "Appengine-google; (+http://code.google.com/appengine; appid:s~goagent0527) "
8.35.201.32 is my proxy IP.
Comparison of large-scale crawler projects, I will definitely use stream_socket_client to connect, you help me to see, this function, how do I use the wrong?
Solved, it is difficult to imagine that I actually found the answer in a Japanese website, so don't feel free to boycott Japanese knowledge, haha! After testing completely can be achieved, you see:
Original address: http://pe5974.sakura.ne.jp/contents/proxy-https.php
When it comes to collecting, the first thing you should think of is PHP's curl function
The best way to do this is to simulate crawlers (e.g., Baidu spider crawler or Google spider crawler), also support proxy configuration
Through the crawler to simulate the browser's head request, there is nothing to catch (theoretically as long as the browser can be requested to the data, whether it is to log in or do not log in can be captured to the content)
Not sure if this agent has a problem, configure their own agent to look at?
or follow up? Stream_socket_client, go ahead and get it.