Curl is powerful, but it cannot be successful even when it is capturing the homepage of the petal network !!!!!!!

Source: Internet
Author: User
Curl is powerful, but it cannot be successful even when it is capturing the homepage of the petal network !!!!!!! Curl is always used to capture pages, which is very convenient and time-honored. However, in a seemingly simple operation of crawling the homepage of Huaban network, it is found that it cannot be successful.

The basic code is as follows:
$ Ch = curl_init ();
Curl_setopt ($ ch, CURLOPT_URL, 'http: // huaban.com /');
// Simulate a spider
// Curl_setopt ($ ch, CURLOPT_USERAGENT, 'mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html )');
// Simulate a normal browser
Curl_setopt ($ ch, CURLOPT_USERAGENT, 'mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0;. net clr 2.0.50727 )');
// Do not use cookies. if you do not log on, you can return to the home page.
// Curl_setopt ($ ch, CURLOPT_USERAGENT ,'');
// You can also enter the address directly.
Curl_setopt ($ ch, CURLOPT_REFERER, 'http: // huaban.com /');
// Curl_setopt ($ ch, CURLOPT_HTTPHEADER, $ header );
// Curl_setopt ($ ch, CURLOPT_HEADER, 0); // output header
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 0 );
Curl_exec ($ ch );
Curl_close ($ ch );

I have tried different cookie header agents repeatedly and cannot return visible pages that are opened by a browser. I even tried file_get_contents ('http: // huaban.com. The vast majority of returned content is JavaScript code. However, pages that were previously successfully crawled, including websites of various sizes and JavaScript, do not affect remote crawling and display. After trying for a day, I was puzzled and discussed in the csdn QQ group. some people say that curl cannot run js. But currently, no js code is available for any website, and there are not a few js websites that have previously been crawled. No ..

I really don't know how to solve it. I 'd like to ask the experts to answer this question. Isn't it curl? is this website abnormal? is it a wrong method ?....


Reply to discussion (solution)

This kind of elegant little fresh website, without JS, allows it to survive in this fiercely competitive market?

The special feature of this website is that most of its content is dynamically generated by js, and new content is continuously generated through interaction between js and backend programs.
Therefore, curl captures only the initial code, that is, the js of a large segment.

The special feature of this website is that most of its content is dynamically generated by js, and new content is continuously generated through interaction between js and backend programs.
Therefore, curl captures only the initial code, that is, the js of a large segment.

"Js interacts with backend programs to continuously generate new content "????
It can be captured by the packet capture program. Why can't ajax content be captured?
No access is found through packet capture. it is reasonable to say that there is always an address for access.

Is this the data you want? I don't know how you capture packets.
{"Filter": "pin: category: all", "pins": [{"pin_id": 8447271, "user_id": 394332, "board_id": 1146189, "file_id": 3483249, "file": {"farm": "farm1", "bucket": "hbimg", "key": "a1524741e8fae0916ba04c8d231f8ad23173ddb5baeff-rNFCpP", "type ": "image/jpeg", "width": 440, "height": 5779, "frames": 1}, "media_type": 0, "source": "weibo.com ", "link ":" http://weibo.com/2134919185/yoVlDsGWs "," Raw_text ":" A small bulb is transformed. you can do it with your own hands ~ "," Text_meta ": {}," via ": 2," via_user_id ": 0," original ": null," created_at ": 1340276725," like_count ": 0, "comment_count": 0, "repin_count": 0, "is_private": 0, "orig_source ":" http://ww4.sinaimg.cn/bmiddle/7f404811jw1du5vv6dpnij.jpg "," User ": {" user_id ": 394332," username ":" Havetogo "," urlname ":" shouji132136652610 "," created_at ": 1338984624," avatar ": {"id": 3061779, "farm": "farm1", "bucket": "hbimg", "key": "69d6d7842159946de9ca070c22da1714f259010afb4-WcVdOr", "type ": "image/jpeg", "width": 100, "height": 100, "frames": 1 }}, "board": {"board_id": 1146189, "user_id": 394332, "title": "Power of Innovation", "description": "", "category_id": null, "seq": 6, "pin_count ": 1, "follow_count": 0, "created_at": 1340276719, "updated_at": 1340276725, "is_private": 0 },{ "pin_id": 8447272, "user_id ": 444560, "board_id": 1146190, "file_id": 2064511, "file": {"farm": "farm1", "bucket": "hbimg", "key ": "aa4fab086fe5887299cf17df48a250f9df25e375c95b-M4izBs", "type": "image/jpeg", "width": 440, "height": 566, "frames": 1}, "media_type": 0, "source": "weibo.com", "link ":" http://weibo.com/2596178104/ycTQfusRg "," Raw_text ":" violet coloring reason: # Popularity of jadeite knowledge # (61) it is generally believed that the native jadeite mine contains a trace of manganese, due to the diversity of manganese and other trace elements such as iron infiltration degree is different, its purple also has different shades of shaving, such as pink purple, eggplant purple, basket purple various Violet. ten spring bamboo, because the jadeite ore contains manganese is a probability event, so the relative number of purple jadeite is very small, coupled with better water, less. "," text_meta ": {" tags ":

....

It seems that I also encountered the same problem, you see this page can simulate post submit data address is http://mixiaba.com/diy/iphoneok.asp? Sid = null & pov = 5

Upstairs, you are submitting data to the QQ space simulation. this is no problem, it is not difficult to achieve.

But the key is that I am talking about this homepage, not to mention submitting data, that is, the most basic page cannot be opened, and the page for post data is the same. this is so strange, it seems that all the pages of this website have been specially processed.

Packet capture can capture data. The data provided on the fourth floor can be captured, but these are invisible. The effect is totally different from that of opening a browser. When the post data is sent, it is returned directly, indicating that the page does not exist.

Upstairs, you are submitting data to the QQ space simulation. this is no problem, it is not difficult to achieve.

But the key is that I am talking about this homepage, not to mention submitting data, that is, the most basic page cannot be opened, and the page for post data is the same. this is so strange, it seems that all the pages of this website have been specially processed.

Packet capture can capture data. The data provided on the fourth floor can be captured, but these are invisible. The effect is totally different from that of opening a browser. When the post data is sent, it is returned directly, indicating that the page does not exist.

The data is not the data content displayed on the homepage. people return the data and use js to achieve the final effect. what html code do you think is captured?

I would like to ask, as you said, the search engine still cannot catch the snapshot content. you have to site the petals on baidu, google, or soso to see if there is any static visible display? Are clearly displayed !!!!

It is reasonable to say that the local browser is visible, so there should be a method of remote visibility.

There is no technical problem in the browser.

Clearly, the search engine is, the snapshot is, and the image is? Let's talk about the site.

I want to know how the search engine captures the content. since the search engine can catch the content, there should be a way to do it.

I have not carefully read the code of the petal, but if I sent the content of the homepage (which should be, or else I want to read what I did ), so it should basically be what I said. what are the comparability between Baidu and Google? How many excellent programmers are there every year? Are they also using curl for crawling? People must use curl to capture the homepage code... if you want to think so, I have nothing to say.

I would like to ask, as you said, the search engine still cannot catch the snapshot content. you have to site the petals on baidu, google, or soso to see if there is any static visible display? Are clearly displayed !!!!

It is reasonable to say that the local browser is visible, so there should be a method of remote visibility.

I wanted to come in and play LM ....

Now, I want to answer the question on the 12th floor,
The search engine also captures data, such as html, js, css, and json,
Including the things provided to you on the 4th floor. the specific page is generated through browser analysis,

As for how google generates snapshots... google has its own browser. do you think it is difficult for them to parse the captured html, js, and css into pages?

Similarly, if necessary, you can also use the caught js, html, and css to generate pages. However, curl alone is not enough.

There is always a solution for everything, and the principle is useless.
Now online login is not completed yet, and data cannot be submitted

Done. thank you!

How can this problem be solved? I have encountered the same problem with you. why?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.