Why can't scrapy crawl the Central Commission for Discipline Inspection website?

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Php Chinese network (www.php.cn) provides the most comprehensive basic tutorial on programming technology, introducing HTML, CSS, Javascript, Python, Java, Ruby, C, PHP, basic knowledge of MySQL and other programming languages. At the same time, this site also provides a large number of online instances, through which you can better learn programming... Reply content: No matter what the device is, the first time you access the site, a 521 error code will pop up and a Cookie will be returned. No matter what the device is, the first time you access the site, a 521 error code will pop up and a Cookie will be returned.
When the browser receives the status code and Cookie, it will make another Request. because the browser receives the Set-Cookie, the second Request Headers will attach the previously received Cookie.
Such a request is successful. Such a request is successful.

This anti-crawler method is very basic and uses different methods for Status Code processing between common crawlers and browsers.

To crawl, you only need to request a Cookie and save it. then, all the requests are attached with the saved Cookie.

The best way for crawlers is to try to imitate users' browser behavior. I thought about whether to answer the question for a while, because the question was labeled as a dead label ...... Degrees (° A ° ') degrees
---------------------------------------------------------------
Chrome is better than Python.

@ Lin can Bin said that the method is correct, but generally the Cookie takes effect for such websites for a long time, so you can directly copy the Cookie value in your Request Headers, put it in the package where you simulate the post request. Open the URL in your browser and view the content in Request Headers. There are mainly the following anti-crawling mechanisms:
1. Cookie: You can write down your cookies. you can also solve a list of websites that need to be crawled by simulated login, such as Weibo;
2. Host: You don't need to talk about this. just bring it and it won't change;
3. Referer: Some websites are abnormal and will not only check if you have Referer information, but also Check whether your Referer is validI won't name it here, for fear of being checked;
4. User-Agent: Some information about your user's environment, such as the browser and operating system, should be taken as much as possible. It is best to have an UA table, Each time a request is constructed, a random UA information is used..

In addition, the most important thing is to work on the IP address.

1. forge the X-forwarded-for header: This is the easiest way to forge an IP address. of course, it is also the easiest way to crack it;
2. use TCP packet injection: This method is encapsulated by Scrapy. you can check related documents. for example, you can also use this method to forge an IP source and perform syn flood denial-of-service attacks (it seems to be biased ......);
3. use the proxy IP address pool: This is the most reliable practice. The disadvantage is that the quality of the proxy IP address is often uneven. Therefore, you need to write a script to maintain this proxy IP address pool, for example, to obtain a new proxy IP address and remove invalid IP addresses from the database (I usually try to access Baidu one by one ).

If the content you crawl is obtained through AJAX, your task is to find the request sent by the JS and simulate it according to the above routine. Sacrifice artifacts
Http://jeanphix.me/Ghost.py/ Crawlers are worms, afraid of the Central Commission for Discipline Inspection, and dare not climb the website of the Central Commission for Discipline Inspection. It is recommended that you write crawlers by yourself with less pages, which is more flexible than using scrapy directly. Not many pages use bs4/re + requests + gevent/threading. I always feel that crawlers are difficult to crawl.
I don't know how the page jumps to sacrifice the Wireshark.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Why can't scrapy crawl the Central Commission for Discipline Inspection website?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Why can't scrapy crawl the Central Commission for Discipline Inspection website?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support