Python Crawl available proxies Ip__python

Source: Internet
Author: User
Problem Description

In doing data crawl, often encounter some sites on the same IP access frequency to do restrictions. There are generally only two solutions to this situation: reducing the crawl frequency. This method does not change frequently in the data, the data quantity is not very good, but, if the data changes frequently or the data quantity is huge, this method obviously cannot satisfy the demand. Use proxy IP. In the process of crawling, the agent IP is often replaced, this method can effectively solve the problem of the same IP access frequency limit. The difficulty with this scenario is how to get a large number of available proxy IP. Proxy IP Access

Proxy IP access to the basic two kinds: the purchase of paid proxy IP. Generally are based on the use of the length and the number of IP agent charges, the advantage is high reliability. Use a free agent. Can be obtained from the free agent website, but the stability is not good, most of them will soon expire.

Pay agent nothing good to say, after payment generally can get data interface, program inside directly call can.

Here is a free proxy IP access and filtering. The tedious work of this effort should, of course, be given to the program for automatic completion.

Here, the West Thorn agent for example. Gets the parsing process for the HTTPS agent and gives the sample program.

By analyzing the page request, you can find the actual request address that contains the HTTPS agent: http://www.xicidaili.com/wn/{page}, first page page=1, second page page=2 ... etc. West Thorn Agent IP every few minutes will be updated once, so each time only grab the first few pages of basic on it.

The network request uses Python's requests library, the page resolves uses pyquery. You can also use Urllib and BeautifulSoup, but personally feel a little bit more troublesome.

Nonsense not much said, the following directly on the program, code based on Python3, if you want to run under the Python2 needs to be slightly modified.

"" "This program is used to obtain the available IP usage 1 from the proxy Web site: Running the file directly will generate the Ips.txt file in the same directory, containing the available agents in the file using Method 2: Other programs import the file, and then use the global variable ' proxies ' that is defined directly within the file. Import random import threading import time from concurrent import Futures import requests to pyquery import Pyquery He Aders = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) \ chrome/53.0.2785.104 safari/537.36 core/1.53.2306.400 QQ browser/9.5.10530.400 '} # Detect proxy IP validity site Check_url = ' https://ip.cn ' # Crawl address (Western thorn agent) Fetch_url = ' Http://www.xicidaili.com/wn /{} ' # crawl pages, 100 pages per page = 3 # proxy type (HTTP/HTTPS) proxy_type = ' HTTPS ' # valid proxy IP list proxies = [] # thread pool, for simultaneous authentication of multiple proxy IP pool = fut Ures.
    Threadpoolexecutor (MAX_WORKERS=50) def add_proxy (PROXY:STR): "" "Add Proxy:p Aram Proxy: Proxy ip+ port number: return: "" "Try:r = Requests.get (Check_url, proxies={proxy_type:proxy}, timeout=30) print (Pyquery (r.cont Ent.decode ()). Find (' #result '). Text (), ' \ n ') if R.status_code = =Proxies:proxies.append (proxy) except Exception as E:if proxy in Proxies:proxies.rem Ove (proxy) print (proxy, E) def fetch_proxy (): "" "" "Grab agent Ip:return:" "for page in range (1, P AGES + 1): R = Requests.get (Fetch_url.format (page), headers=headers) doc = Pyquery (R.content.decode (' Utf-8 ") # Get the list of tables = doc (' #ip_list ') # get all rows = table except the header in the table (' tr:nth-of-
            Type (n+2). Items () # Extract the IP and port number in each row for row in Rows:ip = Row (' Td:nth-of-type (2) '). Text ()
            Port = row (' Td:nth-of-type (3) '). Text () proxy = IP + ': ' + port # in thread pool detect whether the agent is available
            Pool.submit (Add_proxy, proxy) # 10 seconds to crawl next page Time.sleep def run (): while True:try: Fetch_proxy () print (' Active agent: ', proxies) # writes a valid agent to the file with open (' Ips.txt ', ' W '
  coding= ' Utf-8 ') as F:              F.write (' \ n '. Join (proxies)) except Exception as E:print (e) # Grab a break after one time to prevent being blocked Time.sleep (Random.randint (100, 600)) # Starts the crawl thread threading.
 Thread (Target=run). Start ()

After the program runs for a period of time, open the Ips.txt file to see the available proxy IP crawled, as shown in figure:



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.