Why can't scrapy crawl CCDI website?

Source: Internet
Author: User

Reply content:

regardless of the device, the first time you access the station, a 521 error status code will pop up, and a cookie is returned at the same time. Regardless of the device, the first time you access the station, a 521 error status code will pop up, and a cookie is returned at the same time.
The browser accepts the status code and the cookie and requests it again because it receives the Set-cookie, so the second request headers will attach the previously received cookie.
such a request is a success. Such a request is a success.

This anti-crawler approach is very basic, using the common crawler and browser for status code processing method is different.

To crawl, you just need to ask once, get a cookie and save it, and then all the requests are appended with the saved cookie.

The trick for crawlers is to try to mimic the behavior of the user using the browser. Think for a while to answer, because the title was labeled as Dead tag ... ╭ (°a° ') ╮
---------------------------------------------------------------
Chrome Dafa is good, Python Dafa is good.



@ Lin Canbin said the method is correct, but generally this site's cookie effective time is relatively long, so you directly copy the request headers in the cookie value, put in your simulation POST request in the package is OK. You open the URL yourself in the browser, and then view the content in the request headers. For various anti-gripping mechanisms, there are mainly the following:
1. Cookies: This needless to say, you put your cookie down just fine, you can also solve a series of web sites that need to be simulated to crawl, such as Weibo;
2. Host: This does not need to say, take on the good, will not change;
3, Referer: Some websites are more perverted not only to check whether you bring Referer information but also Check if your referer is legal., there is no name, afraid of being checked water meter;
4, User-agent: Here is some information about your users ' environment, such as browsers and operating systems, as much as possible. It's best to have a UA table for yourself, each time a request is constructed, a UA message is randomly used

beyond that, the most important thing is to do it on IP.

1, forged x-forwarded-for head: This is one of the easiest ways to forge IP, and of course it is most easily detected;
2. Using TCP packet injection: This method is scrapy encapsulated, specifically can be used to check the relevant documents, such as can also be used to implement a fake IP source for SYN Flood denial of service attack (seems to go off ...) );
3. Using proxy IP pool: This is the most reliable approach, the disadvantage is that the quality of proxy IP is often uneven, so you also need to write a script to maintain the proxy IP pool, such as to obtain a new proxy IP and reject the invalid IP in the library (I usually try to access Baidu).

If you crawl the content is obtained through AJAX, then your task is to find the JS sent to the request, according to the above routine simulation can be. Sacrifice the artifact.
/ http jeanphix.me/ghost.py/ Reptile is moth, afraid of CCDI, dare not climb CCDI website. This number of pages is not too many suggestions to write their own crawlers, more than the direct use of scrapy flexibility. This content is not a lot of pages with Bs4/re+requests+gevent/threading has been able to grasp, I always think the crawler is difficult to climb down how to deal with things.
Do not know how to jump the page when the artifact Wireshark.
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.