Python crawler practice (iii) -------- sogou article (IP proxy pool and user proxy pool settings ---- scrapy ),
In learning the scrapy crawler framework, it will certainly involve setting the IP proxy pool and User-Agent pool to avoid anti-crawling of websites.
In the past t
In the previous article, I wrote how to enable scrapy to Support HTTP proxy.
But scrapy does not support socks proxy by default. Sometimes the pure HTTP proxy is easily intercepted by g f w, and proxy is required to collect websit
One, manually update the IP pool
method One:
1. Add the IP pool in the settings profile:
ippool=[
{"ipaddr": "61.129.70.131:8080"},
{"ipaddr": "61.152.81.193:9100"},
{"ipaddr": " 120.204.85.29:3128 "},
{" ipaddr ":" 219.228.126.86:8123 "},
{" ipaddr ":" 61.152.81.193:9100 "},
{" IPAddr " : "218.82.33.225:53853"},
{"ipaddr": "223.167.190.17:42789"}
]
These IP can be obtained from this several websites: Quick agent, Agent 66, have agent, West Thorn Age
1, HTTP://WWW.XICIDAILI.COM/WT domestic free agent website
2, using Scrapy crawl the site's IP address and port, write txt document
3, write script test txt document IP address and port is available
4, the available IP address and port input TXT document
————————————————————————1. Write Item classBecause we only need IP address and port, so write only one attribute can
#-*-Coding:utf-8-*-
# Define Here's models for your scraped items # to documentati
First of all, let's keep you waiting. Originally intended to be updated 520 that day, but a fine thought, also only I such a single dog still doing scientific research, we may not mind to see the updated article, so dragged to today. But busy 521,522 this day and a half, I have added the database, fixed some bugs (now someone will say that really is a single dog).Well, don't say much nonsense, let's go into today's theme. On two articles scrapy climbe
First of all, let's keep you waiting. Originally intended to 520 that day to update, but a fine thought, also only I such a single dog still doing scientific research, we may not mind to see the updated article, so dragged to today. But I'm busy. 521,522 This day and a half, I have added the database, fixed some bugs( Now someone will say that really is a single dog ).Well, don't say much nonsense, let's go into today's theme. On two articles scrapy
Start by getting ready to create a scrapy project with the following directory structure:Note: There are 3 more files in the Spiders directory, Db.py,default.init and Items.json. Db.py is my simple encapsulation of a database access to the Lib file, Default.init is my database and agent-related configuration file, Items.json is the final output file.There are 2 ways to add proxies to a request, the first is to rewrite the Start_request method of your
This article mainly introduces how to use the proxy server when collecting data based on scrapy. it involves the skills of using the proxy server in Python and has some reference value, for more information about how to use the proxy server when collecting data from scrapy,
When crawling site content, the most common problem is: the site has limited IP, there will be anti-grab function, the best way is IP rotation crawl (plus agent)Here's how scrapy Configure the agent for crawling1. Create a new "middlewares.py" under the Scrapy project
1234567891011121314
#Importingbase64librarybecausewe‘llneeditONLYincaseiftheproxywearegoingtouserequiresauthentication impor
Scrapy disguise proxy and use of fake_userAgent, scrapyuseragent
In disguise, browser proxy crawling web pages is not very high for some servers to filter requests. You do not need ip addresses to disguise requests and directly send your browser information to disguise.
Method 1:
1. Add the following content to the setting. py file, which is the header informatio
This site is relatively simple, so the first example of a crawler code is as follows:
#-*-Coding:utf-8-*-"Created on June 12, 2017 get dynamic IP information from the domestic high stealth proxy IP website @see: HTTP://WWW.XICIDAILI.COM/NN/1 @author: Dzm ' ' Import sys reload (SYS) sys.setdefaultencoding (' UTF8 ') import scrapy from pyquery import pyquery as PQ from Eie.middlewa Res import udf_config f
When crawling site content, the most common problem is: the site has limited IP, there will be anti-grab function, the best way is IP rotation crawl (plus agent)
Here's how scrapy Configure the agent for crawling
1. Create a new "middlewares.py" under the Scrapy project
# Importing Base64 library because we ' ll need it only if the the proxy we is going to use r
Reprinted from: http://www.python_tab.com/html/2014/pythonweb_0326/724.htmlWhen crawling site content, the most common problem is: the site has limited IP, there will be anti-grab function, the best way is IP rotation crawl (plus agent)Here's how scrapy Configure the agent for crawling1. Create a new "middlewares.py" under the Scrapy project#Importing Base64 library because we ' ll need it only if the
Python uses the proxy server method when collecting data based on scrapy, pythonscrapy
This example describes how to use a proxy server to collect data from Python Based on scrapy. Share it with you for your reference. The details are as follows:
# To authenticate the proxy
1. Create a new "middlewares.py" under the Scrapy project
# Importing Base64 library because we ' ll need it only if the the proxy we is going to use requires authenticationimpo RT base64# Start Your middleware classclass Proxymiddleware (object): # Overwrite process request Def process_request (self, Request, Spider): # Set The location of the proxy request
Python scrapy ip proxy settings, pythonscrapy
Create a python directory at the same level as the spider in the scrapy project and add a py file
# Encoding: UTF-8Import base64ProxyServer = proxy server address # My website is 'HTTP: // proxy.abuyun.com: 661'# Proxy tunnel v
In the Scrapy project, build a Python directory that is similar to the spider and add a py file with the contents below# Encoding:utf-8Import Base64ProxyServer = Proxy server address # #我的是 ' http://proxy.abuyun.com:9010 '# Proxy Tunneling Authentication Information This is the application on that website.Proxyuser = user NameProxypass = passwordProxyauth = "Bas
This article describes the way Python uses a proxy server when collecting data based on scrapy. Share to everyone for your reference. Specifically as follows:
# to authenticate the proxy,
#you must set the proxy-authorization header.
#You *cannot* Use the form http://user:pass@proxy:port
#in request.meta['
Create a middlewares. py file in the setting. py directory at the same level.
class
ProxyMiddleware(
object
):
# overwrite process request
def
process_request(
self
, request, spider):
# Set the location of the proxy
request.meta[
‘proxy‘
]
=
"http://YOUR_PROXY_IP:PORT"
And then add it to setting. py.
DOWNLOADER_MIDDLEWARES
=
{
‘scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.