Scrapy in WIN10 environment with Tor for anonymous crawling

Source: Internet
Author: User

The content of this article source: http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu/

When using Scrapy, once a high-frequency crawl is easily blocked by IP, you can use Tor for anonymous crawling while installing the Polipo proxy Server

Note: If you want to do the following, you can FQ

Install Tor

: Https://www.torproject.org/download/download.html.en

Download the expert bundle and extract to a directory, for example: D:\Tor, this version does not have a graphical interface, to modify the configuration is very troublesome, you can download Vidalia to use Tor,vidalia: https:// people.torproject.org/~erinn/vidalia-standalone-bundles/, download the bottom of the page: vidalia-standalone-0.2.21-win32-1_ Zh-cn.exe, after the installation is complete, run start Vidalia.exe with administrator privileges for the following settings

Click Start Tor

After a while it shows the connection is successful

Download and install Polipo

: http://www.pps.univ-paris-diderot.fr/~jch/software/files/polipo/

Select Polipo-1.1.0-win32.zip, download and unzip, then edit the extracted file Config.sample, add the following configuration at the beginning of the file

" localhost:9050 " =""

Run the program under this directory with the cmd command: polipo.exe-c config.sample

Open Edge Browser, set up proxy

Then access in the browser: https://check.torproject.org/

See the following interface means the configuration is successful

Configure Scrapy

Add the following content to the settings.py file

#More comprehensive list can is found at#http://techpatterns.com/forums/about304.htmluser_agent_list= [    'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.7 (khtml, like Gecko) chrome/16.0.912.36 safari/535.7',    'mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) gecko/16.0 firefox/16.0',    'mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) applewebkit/534.55.3 (khtml, like Gecko) version/5.1.3 safari/534.53.10',    ] Http_proxy='http://127.0.0.1:8123'Downloader_middlewares= {    'Myspider.middlewares.RandomUserAgentMiddleware': 400,#Modify the Myspider here for the project name    'Myspider.middlewares.ProxyMiddleware': 410,#Ibid .    'Scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,}

Create a new middlewares.py file in the root directory of the Scrapy project and enter the following

ImportRandom fromScrapy.confImportSettings fromScrapyImportLogclassRandomuseragentmiddleware (object):defprocess_request (self, request, spider): UA= Random.choice (Settings.get ('user_agent_list'))        ifUa:request.headers.setdefault ('user-agent', UA)#This is just to check which the user agent is being used for requestspider.log (U'user-agent: {} {}'. Format (Request.headers.get ('user-agent'), request), level=log. DEBUG)classProxymiddleware (object):defprocess_request (self, request, spider): request.meta['Proxy'] = Settings.get ('Http_proxy')

At this point, the integration of scrapy and TRO is complete, this article is not responsible for the consequences of anyone using this method.

Configure Tor Browser

The following is not related to the above, just a note on how to use the Tor browser, on our download Tor page, there is a download option (the first is a browser, through which the browser can anonymously access the Web page, Tor browser automatically through the Tor network to launch Tor's background process to connect to the network. Privacy sensitive data, such as HTTP cookies and browsing history, is automatically deleted as soon as the program is closed to avoid eavesdropping and keeping privacy on the Internet

After downloading the first Tor browser and installing it, make the following configuration

Because the connection to Tor was dropped from the wall, configure the bridge

Get Bridge: Https://bridges.torproject.org/options

Copy the bridge and paste it into the Tor browser

Sometimes the connection is not successful, you need to apply for a new bridge to try

Scrapy in WIN10 environment with Tor for anonymous crawling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.