Curl is a powerful web page, but it cannot be solved!
Source: Internet
Author: User
Curl is powerful, but it cannot be successful even when it is capturing the homepage of the petal network !!!!!!! Curl is always used to capture pages, which is very convenient and time-honored. However, in a seemingly simple operation of crawling the homepage of Huaban network, it is found that it cannot be successful. The basic code is as follows: $ ch = curl_init (); curl_setopt ($ ch, CURLOPT_URL, 'http: // hua Curl is known as powerful, but when you are capturing the homepage of the Huaban network, how can this problem be solved !!!!!!!
Curl is always used to capture pages, which is very convenient and time-honored. However, in a seemingly simple operation of crawling the homepage of Huaban network, it is found that it cannot be successful.
The basic code is as follows:
$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, 'http: // huaban.com /');
// Simulate a spider
// Curl_setopt ($ ch, CURLOPT_USERAGENT, 'mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html )');
// Simulate a normal browser
Curl_setopt ($ ch, CURLOPT_USERAGENT, 'mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0;. net clr 2.0.50727 )');
// Do not use cookies. if you do not log on, you can return to the home page.
// Curl_setopt ($ ch, CURLOPT_USERAGENT ,'');
// You can also enter the address directly.
Curl_setopt ($ ch, CURLOPT_REFERER, 'http: // huaban.com /');
// Curl_setopt ($ ch, CURLOPT_HTTPHEADER, $ header );
// Curl_setopt ($ ch, CURLOPT_HEADER, 0); // output header
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 0 );
Curl_exec ($ ch );
Curl_close ($ ch );
I have tried different cookie header agents repeatedly and cannot return visible pages that are opened by a browser. I even tried file_get_contents ('http: // huaban.com. The vast majority of returned content is JavaScript code. However, pages that were previously successfully crawled, including websites of various sizes and JavaScript, do not affect remote crawling and display. After trying for a day, I was puzzled and discussed in the csdn QQ group. some people say that curl cannot run js. But currently, no js code is available for any website, and there are not a few js websites that have previously been crawled. No ..
I really don't know how to solve it. I 'd like to ask the experts to answer this question. Isn't it curl? is this website abnormal? is it a wrong method ?....
------ Solution --------------------
This kind of elegant little fresh website, without JS, allows it to survive in this fiercely competitive market?
------ Solution --------------------
The special feature of this website is that most of its content is dynamically generated by js, and new content is continuously generated through interaction between js and backend programs.
Therefore, curl captures only the initial code, that is, the js of a large segment.
------ Solution --------------------
Is this the data you want? I don't know how you capture packets.
{"Filter": "pin: category: all", "pins": [{"pin_id": 8447271, "user_id": 394332, "board_id": 1146189, "file_id": 3483249, "file": {"farm": "farm1", "bucket": "hbimg", "key": "a1524741e8fae0916ba04c8d231f8ad23173ddb5baeff-rNFCpP", "type ": "image/jpeg", "width": 440, "height": 5779, "frames": 1}, "media_type": 0, "source": "weibo.com ", "link ":" http://weibo.com/2134919185/yoVlDsGWs "," Raw_text ":" A small bulb is transformed. you can do it with your own hands ~ "," Text_meta ": {}," via ": 2," via_user_id ": 0," original ": null," created_at ": 1340276725," like_count ": 0, "comment_count": 0, "repin_count": 0, "is_private": 0, "orig_source ":" http://ww4.sinaimg.cn/bmiddle/7f404811jw1du5vv6dpnij.jpg "," User ": {" user_id ": 394332," username ":" Havetogo "," urlname ":" shouji132136652610 "," created_at ": 1338984624," avatar ": {"id": 3061779, "farm": "farm1", "bucket": "hbimg", "key": "69d6d7842159946de9ca070c22da1714f259010afb4-WcVdOr", "type ": "image/jpeg", "width": 100, "height": 100, "frames": 1 }}, "board": {"board_id": 1146189, "user_id": 394332, "title": "Power of Innovation", "description": "", "category_id": null, "seq": 6, "pin_count ": 1, "follow_count": 0, "created_at": 1340276719, "updated_at": 1340276725, "is_private": 0 },{ "pin_id": 8447272, "user_id ": 444560, "board_id": 1146190, "file_id": 2064511, "file": {"farm": "farm1", "bucket": "hbimg", "key ": "aa4fab086fe5887299cf17df48a250f9df25e375c95b-M4izBs", "type": "image/jpeg", "width": 440, "height": 566, "frames": 1}, "media_type": 0, "source": "weibo.com", "link ":" http://weibo.com/2596178104/ycTQfusRg "," Raw_text ":" violet coloring reason: # Popularity of jadeite knowledge # (61) it is generally believed that the native jadeite mine contains a trace of manganese, due to the diversity of manganese and other trace elements such as iron infiltration degree is different, its purple also has different shades of shaving, such as pink purple, eggplant purple, basket purple various Violet. ten spring bamboo, because the jadeite ore contains manganese is a probability event, so the relative number of purple jadeite is very small, coupled with better water, less. "," text_meta ": {" tags ":
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.