A summary of the anti-crawler strategy for the Python web site _python

Source: Internet
Author: User

This article introduces the Web site of the anti-crawler strategy, here I have to write the crawler encountered in the various strategies and countermeasures to summarize the methods.

Functionally speaking, reptiles are generally divided into data acquisition, processing, storage three parts. Here we only discuss the Data acquisition section.

General website from three aspects of the anti-crawler: User requested headers, user behavior, site directory and data loading methods. The first two are more likely to be encountered, and most sites are from these angles to counter reptiles. The third type of Web site that uses Ajax is used, which increases the difficulty of crawling (preventing static crawlers from dynamically loading pages with Ajax technology).

1, from the user request headers is the most common anti-reptile strategy.

Camouflage header. Many sites will be headers user-agent to detect, there are a number of sites will be detected referer (some resources of the site's anti-theft chain is to detect referer). If you encounter such an anti-reptile mechanism, you can add headers directly to the crawler, copy the browser's user-agent to the headers of the crawler, or modify the Referer value to the target site domain name [comments: Often easily overlooked, through the analysis of the request to grasp the packet, Determines referer, adding in the program to simulate an access request header. For detecting headers, modifying or adding headers in a reptile can be a good bypass.

2. Anti-crawler based on user behavior

There are also a number of sites that detect user behavior, such as the same IP for a short period of time to access the same page, or the same account for a short period of time to do the same operation. [This kind of climbing, need to have enough IP to deal with]

(1), most of the site is the former situation, for this situation, the use of IP agents can be resolved. Can be dedicated to write a crawler, crawl online public proxy IP, after testing all saved up. With a large number of proxy IP can be a few times each request to replace an IP, which is easy to do in requests or urllib, so it is easy to bypass the first type of anti-reptile.

To write a reptile agent:

Steps:

1. Parameter is a dictionary {' type ': ' Proxy IP: port number '}
Proxy_support=urllib.request.proxyhandler ({})
2. Custom, create a opener
Opener=urllib.request.build_opener (Proxy_support)
3a. Installation of opener
Urllib.request.install_opener (opener)
3b. Call opener
Opener.open (URL)

Use a large number of agents to randomly request the target site to counter the crawler

 #!/usr/bin/env python3.4 #-*-coding:utf-8-*-#__author__ = = "Tyomcat" Import URLLIB.R Equest import random Import re url= ' http://www.whatismyip.com.tw ' iplist=[' 121.193.143.249:80 ', ' 112.126.65.193:80 ', ' 122.96.59.104:82 ', ' 115.29.98.139:9999 ', ' 117.131.216.214:80 ', ' 116.226.243.166:8118 ', ' 101.81.22.21:8118 ', ' 122.96.59.107:843 '] Proxy_support = Urllib.request.ProxyHandler ({' http ': Random.choice (IPList)}) opener= Urllib.request.build_opener (Proxy_support) opener.addheaders=[(' user-agent ', ' mozilla/5.0 (X11; Linux x86_64) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.87 safari/537.36 ')] Urllib.request.install_ Opener (opener) response = Urllib.request.urlopen (URL) html = Response.read (). Decode (' utf-8 ') pattern = Re.compile (' < H1> (. *?)  
 

(2), for the second case, the next request can be made at random intervals of several seconds after each request. Some Web sites with logical vulnerabilities can be requested several times, log off, log on again, and continue with the request to bypass the same account for a short period of time without limiting the same request. [Comments: For the account to do climbing restrictions, generally difficult to deal with, a random number of seconds may also often be blocked, if you can have more than one account, switching use, better results]

3, dynamic page of the anti-crawler

Most of the above are in the static page, there are a number of sites, we need to crawl data is through the AJAX request, or through Java generated.

Solution: SELENIUM+PHANTOMJS

Selenium: Automated Web test solutions that completely simulate a real-world browser environment and completely simulate virtually all user actions

PHANTOMJS: A browser without a graphical interface

Get the personal details address of the sister Taobao:

#! /usr/bin/env python
#-*-coding:utf-8-*-
#__author__ = = "Tyomcat" from

Selenium import Webdriver
Import time
import re

drive = Webdriver. Phantomjs (executable_path= ' Phantomjs-2.1.1-linux-x86_64/bin/phantomjs ')
drive.get (' https://mm.taobao.com/ Self/model_info.htm?user_id=189942305&is_coment=false ')

time.sleep (5) Pattern

= Re.compile (R ' < Div.*?mm-p-domain-info ">.*?class=" Mm-p-info-cell clearfix ">.*?<li>.*?<label> (. *?) </label><span> (. *?) </span> ', Re. S)
html=drive.page_source.encode (' utf-8 ', ' ignore ')
Items=re.findall (pattern,html) for
item in Items:
  print item[0], ' http: ' +item[1]
drive.close ()

Thank you for reading, I hope to help you, thank you for your support for this site!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.