Describes how to use a proxy to collect Python Scrapy crawler framework.

Source: Internet
Author: User
Tags python scrapy

Describes how to use a proxy to collect Python Scrapy crawler framework.

1. Create "middlewares. py" under the Scrapy Project"

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authenticationimport base64# Start your middleware classclass ProxyMiddleware(object): # overwrite process request def process_request(self, request, spider):  # Set the location of the proxy  request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"  # Use the following lines if your proxy requires authentication  proxy_user_pass = "USERNAME:PASSWORD"  # setup basic authentication for the proxy  encoded_user_pass = base64.encodestring(proxy_user_pass)  request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2. Add (./project_name/settings. py) in the project configuration file

DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110, 'project_name.middlewares.ProxyMiddleware': 100,}

In only two steps, the request passes through the proxy. Test ^_^.

from scrapy.spider import BaseSpiderfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.http import Requestclass TestSpider(CrawlSpider): name = "test" domain_name = "whatismyip.com" # The following url is subject to change, you can get the last updated one from here : # http://www.whatismyip.com/faq/automation.asp start_urls = ["http://xujian.info"] def parse(self, response):  open('test.html', 'wb').write(response.body)

3. Use random user-agent

By default, only one user-agent can be used for scrapy collection, which is easily blocked by websites, the following code randomly selects one from the pre-defined user-agent list to collect different pages.

Add the following code in settings. py:

DOWNLOADER_MIDDLEWARES = {  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,  'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400 }

Note: Crawler is the name of your project. The name of a directory contains the spider code.

#! /Usr/bin/python #-*-coding: UTF-8-*-import randomfrom scrapy. contrib. downloadermiddleware. useragent import UserAgentMiddlewareclass RotateUserAgentMiddleware (UserAgentMiddleware): def _ init _ (self, user_agent = ''): self. user_agent = user_agent def process_request (self, request, spider): # This statement is used to randomly select user-agent ua = random. choice (self. user_agent_list) if ua: request. headers. setdefault ('user-agent', ua) # the default user_agent_list composes chrome, I E, firefox, Mozilla, opera, netscape # for more User Agent strings, you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) chrome/22.0.1207.1 Safari/537.1 "\" Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/Listen 0.1132.57 Safari/536.11 ", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/4150.20.2.0 Safari/536.6", \ "Mozilla/5.0 (Windows NT 6.2) appleWebKit/536.6 (KHTML, like Gecko) Chrome/255.0.20.0.0 Safari/536.6 ", \" Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) chrome/19.77.34.5 Safari/537.1 ", \" Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.20.4.9 Safari/536.5 ", \ "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.20.4.36 Safari/536.5", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.3.0 Safari/536.3 ", \" Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) chrome/19.0.20.3.0 Safari/536.3 ", \" Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.3.0 Safari/536.3 ", \ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.2.0 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.2.0 Safari/536.3 ", \" Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) chrome/19.0.20.1.1 Safari/536.3 ", \" Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.1.1 Safari/536.3 ", \ "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.1.1 Safari/536.3", \ "Mozilla/5.0 (Windows NT 6.2) appleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.20.1.0 Safari/536.3 ", \" Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) chrome/19.0.1055.1 Safari/535.24 ", \" Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24 "]

Articles you may be interested in:
  • No basic write python crawler: Use Scrapy framework to write Crawlers
  • Installing and configuring Scrapy
  • Install and use the Python crawler framework Scrapy
  • Example of using scrapy to parse js in python
  • Simple learning notes for Python Scrapy crawler framework
  • In-depth analysis of the structure and operation process of the Python crawler framework Scrapy
  • Practice the Python crawler framework Scrapy to capture the TOP250 Douban movie
  • Simple spider collection program based on scrapy
  • Python uses scrapy to download a large page when collecting data
  • How to print scrapy by using Python to capture the Tree Structure

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.