Python Crawl available proxies Ip_

Python Crawl available proxies Ip__python

Last Update:2018-07-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Problem Description

In doing data crawl, often encounter some sites on the same IP access frequency to do restrictions. There are generally only two solutions to this situation: reducing the crawl frequency. This method does not change frequently in the data, the data quantity is not very good, but, if the data changes frequently or the data quantity is huge, this method obviously cannot satisfy the demand. Use proxy IP. In the process of crawling, the agent IP is often replaced, this method can effectively solve the problem of the same IP access frequency limit. The difficulty with this scenario is how to get a large number of available proxy IP. Proxy IP Access

Proxy IP access to the basic two kinds: the purchase of paid proxy IP. Generally are based on the use of the length and the number of IP agent charges, the advantage is high reliability. Use a free agent. Can be obtained from the free agent website, but the stability is not good, most of them will soon expire.

Pay agent nothing good to say, after payment generally can get data interface, program inside directly call can.

Here is a free proxy IP access and filtering. The tedious work of this effort should, of course, be given to the program for automatic completion.

Here, the West Thorn agent for example. Gets the parsing process for the HTTPS agent and gives the sample program.

By analyzing the page request, you can find the actual request address that contains the HTTPS agent: http://www.xicidaili.com/wn/{page}, first page page=1, second page page=2 ... etc. West Thorn Agent IP every few minutes will be updated once, so each time only grab the first few pages of basic on it.

The network request uses Python's requests library, the page resolves uses pyquery. You can also use Urllib and BeautifulSoup, but personally feel a little bit more troublesome.

Nonsense not much said, the following directly on the program, code based on Python3, if you want to run under the Python2 needs to be slightly modified.

"" "This program is used to obtain the available IP usage 1 from the proxy Web site: Running the file directly will generate the Ips.txt file in the same directory, containing the available agents in the file using Method 2: Other programs import the file, and then use the global variable ' proxies ' that is defined directly within the file. Import random import threading import time from concurrent import Futures import requests to pyquery import Pyquery He Aders = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) \ chrome/53.0.2785.104 safari/537.36 core/1.53.2306.400 QQ browser/9.5.10530.400 '} # Detect proxy IP validity site Check_url = ' https://ip.cn ' # Crawl address (Western thorn agent) Fetch_url = ' Http://www.xicidaili.com/wn /{} ' # crawl pages, 100 pages per page = 3 # proxy type (HTTP/HTTPS) proxy_type = ' HTTPS ' # valid proxy IP list proxies = [] # thread pool, for simultaneous authentication of multiple proxy IP pool = fut Ures.
    Threadpoolexecutor (MAX_WORKERS=50) def add_proxy (PROXY:STR): "" "Add Proxy:p Aram Proxy: Proxy ip+ port number: return: "" "Try:r = Requests.get (Check_url, proxies={proxy_type:proxy}, timeout=30) print (Pyquery (r.cont Ent.decode ()). Find (' #result '). Text (), ' \ n ') if R.status_code = =Proxies:proxies.append (proxy) except Exception as E:if proxy in Proxies:proxies.rem Ove (proxy) print (proxy, E) def fetch_proxy (): "" "" "Grab agent Ip:return:" "for page in range (1, P AGES + 1): R = Requests.get (Fetch_url.format (page), headers=headers) doc = Pyquery (R.content.decode (' Utf-8 ") # Get the list of tables = doc (' #ip_list ') # get all rows = table except the header in the table (' tr:nth-of-
            Type (n+2). Items () # Extract the IP and port number in each row for row in Rows:ip = Row (' Td:nth-of-type (2) '). Text ()
            Port = row (' Td:nth-of-type (3) '). Text () proxy = IP + ': ' + port # in thread pool detect whether the agent is available
            Pool.submit (Add_proxy, proxy) # 10 seconds to crawl next page Time.sleep def run (): while True:try: Fetch_proxy () print (' Active agent: ', proxies) # writes a valid agent to the file with open (' Ips.txt ', ' W '
  coding= ' Utf-8 ') as F:              F.write (' \ n '. Join (proxies)) except Exception as E:print (e) # Grab a break after one time to prevent being blocked Time.sleep (Random.randint (100, 600)) # Starts the crawl thread threading.
 Thread (Target=run). Start ()

After the program runs for a period of time, open the Ips.txt file to see the available proxy IP crawled, as shown in figure:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawl available proxies Ip__python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawl available proxies Ip__python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support