Why can't scrapy crawl the Central Commission for Discipline Inspection website?

Source: Internet
Author: User
Php Chinese network (www.php.cn) provides the most comprehensive basic tutorial on programming technology, introducing HTML, CSS, Javascript, Python, Java, Ruby, C, PHP, basic knowledge of MySQL and other programming languages. At the same time, this site also provides a large number of online instances, through which you can better learn programming... Reply content: No matter what the device is, the first time you access the site, a 521 error code will pop up and a Cookie will be returned. No matter what the device is, the first time you access the site, a 521 error code will pop up and a Cookie will be returned.
When the browser receives the status code and Cookie, it will make another Request. because the browser receives the Set-Cookie, the second Request Headers will attach the previously received Cookie.
Such a request is successful. Such a request is successful.

This anti-crawler method is very basic and uses different methods for Status Code processing between common crawlers and browsers.

To crawl, you only need to request a Cookie and save it. then, all the requests are attached with the saved Cookie.

The best way for crawlers is to try to imitate users' browser behavior. I thought about whether to answer the question for a while, because the question was labeled as a dead label ...... Degrees (° A ° ') degrees
---------------------------------------------------------------
Chrome is better than Python.



@ Lin can Bin said that the method is correct, but generally the Cookie takes effect for such websites for a long time, so you can directly copy the Cookie value in your Request Headers, put it in the package where you simulate the post request. Open the URL in your browser and view the content in Request Headers. There are mainly the following anti-crawling mechanisms:
1. Cookie: You can write down your cookies. you can also solve a list of websites that need to be crawled by simulated login, such as Weibo;
2. Host: You don't need to talk about this. just bring it and it won't change;
3. Referer: Some websites are abnormal and will not only check if you have Referer information, but also Check whether your Referer is validI won't name it here, for fear of being checked;
4. User-Agent: Some information about your user's environment, such as the browser and operating system, should be taken as much as possible. It is best to have an UA table, Each time a request is constructed, a random UA information is used..

In addition, the most important thing is to work on the IP address.

1. forge the X-forwarded-for header: This is the easiest way to forge an IP address. of course, it is also the easiest way to crack it;
2. use TCP packet injection: This method is encapsulated by Scrapy. you can check related documents. for example, you can also use this method to forge an IP source and perform syn flood denial-of-service attacks (it seems to be biased ......);
3. use the proxy IP address pool: This is the most reliable practice. The disadvantage is that the quality of the proxy IP address is often uneven. Therefore, you need to write a script to maintain this proxy IP address pool, for example, to obtain a new proxy IP address and remove invalid IP addresses from the database (I usually try to access Baidu one by one ).

If the content you crawl is obtained through AJAX, your task is to find the request sent by the JS and simulate it according to the above routine. Sacrifice artifacts
Http://jeanphix.me/Ghost.py/ Crawlers are worms, afraid of the Central Commission for Discipline Inspection, and dare not climb the website of the Central Commission for Discipline Inspection. It is recommended that you write crawlers by yourself with less pages, which is more flexible than using scrapy directly. Not many pages use bs4/re + requests + gevent/threading. I always feel that crawlers are difficult to crawl.
I don't know how the page jumps to sacrifice the Wireshark.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.