The content of this article source: http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu/
When using Scrapy, once a high-frequency crawl is easily blocked by IP, you can use Tor for anonymous crawling while installing the Polipo proxy Server
Note: If you want to do the following, you can FQ
Install Tor
: Https://www.torproject.org/download/download.html.en
Download the expert bundle and extract to a directory, for example: D:\Tor, this version does not have a graphical interface, to modify the configuration is very troublesome, you can download Vidalia to use Tor,vidalia: https:// people.torproject.org/~erinn/vidalia-standalone-bundles/, download the bottom of the page: vidalia-standalone-0.2.21-win32-1_ Zh-cn.exe, after the installation is complete, run start Vidalia.exe with administrator privileges for the following settings
Click Start Tor
After a while it shows the connection is successful
Download and install Polipo
: http://www.pps.univ-paris-diderot.fr/~jch/software/files/polipo/
Select Polipo-1.1.0-win32.zip, download and unzip, then edit the extracted file Config.sample, add the following configuration at the beginning of the file
" localhost:9050 " =""
Run the program under this directory with the cmd command: polipo.exe-c config.sample
Open Edge Browser, set up proxy
Then access in the browser: https://check.torproject.org/
See the following interface means the configuration is successful
Configure Scrapy
Add the following content to the settings.py file
#More comprehensive list can is found at#http://techpatterns.com/forums/about304.htmluser_agent_list= [ 'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.7 (khtml, like Gecko) chrome/16.0.912.36 safari/535.7', 'mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) gecko/16.0 firefox/16.0', 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/534.55.3 (khtml, like Gecko) version/5.1.3 safari/534.53.10', ] Http_proxy='http://127.0.0.1:8123'Downloader_middlewares= { 'Myspider.middlewares.RandomUserAgentMiddleware': 400,#Modify the Myspider here for the project name 'Myspider.middlewares.ProxyMiddleware': 410,#Ibid . 'Scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,}
Create a new middlewares.py file in the root directory of the Scrapy project and enter the following
ImportRandom fromScrapy.confImportSettings fromScrapyImportLogclassRandomuseragentmiddleware (object):defprocess_request (self, request, spider): UA= Random.choice (Settings.get ('user_agent_list')) ifUa:request.headers.setdefault ('user-agent', UA)#This is just to check which the user agent is being used for requestspider.log (U'user-agent: {} {}'. Format (Request.headers.get ('user-agent'), request), level=log. DEBUG)classProxymiddleware (object):defprocess_request (self, request, spider): request.meta['Proxy'] = Settings.get ('Http_proxy')
At this point, the integration of scrapy and TRO is complete, this article is not responsible for the consequences of anyone using this method.
Configure Tor Browser
The following is not related to the above, just a note on how to use the Tor browser, on our download Tor page, there is a download option (the first is a browser, through which the browser can anonymously access the Web page, Tor browser automatically through the Tor network to launch Tor's background process to connect to the network. Privacy sensitive data, such as HTTP cookies and browsing history, is automatically deleted as soon as the program is closed to avoid eavesdropping and keeping privacy on the Internet
After downloading the first Tor browser and installing it, make the following configuration
Because the connection to Tor was dropped from the wall, configure the bridge
Get Bridge: Https://bridges.torproject.org/options
Copy the bridge and paste it into the Tor browser
Sometimes the connection is not successful, you need to apply for a new bridge to try
Scrapy in WIN10 environment with Tor for anonymous crawling