Compile web crawler in Python

Source: Internet
Author: User
Tags server website

I. Preparations

To complete a web crawler applet, you need to prepare the following:

1. Understand basic HTTP protocols

2. Familiar with urllib2 library interface

3. Familiar with Python Regular Expressions

Ii. Programming ideas

Here is just a basic web crawler program. Its basic ideas are as follows:

1. Find the webpage to be crawled, view its source code, and analyze the HTML rules of the webpage.

2. Use the urllib2 library to read the required webpage

3. Use regular expressions to correctly extract the required webpage Information

4. Check the validity of the obtained webpage data, that is, information screening.

5. Store valid data information, such as files or databases.

3. Crawling Web Proxy Server crawler instances

Find a proxy server website, such. It mainly uses regular expressions and urllib2 libraries. The Code is as follows:

Proxylist1 = queue. queue () # used to store all the server proxy IP addresses and port numbers captured from the page portdicts = {'Z': "3", 'M': "4", 'A ': "2", 'L': "9", 'F': "0", 'B': "5", 'I': "7", 'w ': "6", 'x': "8", 'C': "1", 'r': "8", 'D': "0"} def get_proxy_from_cnproxy (): Global proxylist1 P = Re. compile (r''' <tr> <TD> (. + ?) <SCRIPT type = text/JavaScript> document. Write \ (":" \ + (. + ?) \) </SCRIPT> </TD> <TD> (. + ?) </TD> <TD>. +? </TD> <TD> (. + ?) </TD> </tr> ''') for I in range (): Target = r "http://www.cnproxy.com/proxy%d.html" % I print target Req = urllib2.urlopen (target) Result = req. read () matchs = P. findall (result) # print matchs for row in matchs: IP = row [0] Port = row [1] If port is none: Continue TMP = port. split ('+') # The port in the HTML cannot find the values corresponding to some keys. filter out these IP addresses. Flag = 0 for X in TMP: If x not in portdicts: flag = 1 break if flag = 1: Continue Port = map (lambda X: portdicts [X], port. split ('+') Port = ''. join (port) agent = row [2] ADDR = row [3]. decode ("cp936 "). encode ("UTF-8") L = [IP, port, agent, ADDR] print l proxylist1.put (l) print "page 1-10 size: % s Nums proxy info "% proxylist1.qsize ()

The above code extracts the required webpage fields and stores them in the queue proxylist1. Next, you need to verify each field in the queue proxylist1 to determine the validity of the data in it, and then store the detected data in the proxycheckedlist of another queue, sort the valid data information and save it to the file. The Code is as follows:

Proxycheckedlist = queue. queue () # used to store the valid proxy IP address and port number information obtained after verification class proxycheck (threading. thread): def _ init _ (self, fname): threading. thread. _ init _ (Self) self. timeout = 5 # self. test_url = "http://www.baidu.com/" # self. test_str = "030173" # self. test_url = "http://www.so.com/" # self. test_str = '000000' self. test_url = "http://www.renren.com" self. test_str = "110000000009" self. fname = fname self. checkedproxylist = [] def checkproxy (Self): threadpool = [] For I in range (10): # create 10 threads and put them in the threadpool. append (ck_process (self. test_url, self. test_str, self. timeout, I) # Start 10 threads to process the verification work at the same time map (lambda X: X. start (), threadpool) # Wait for the thread to exit map (lambda X: X. join (), threadpool) while proxycheckedlist. empty () = false: Try: content = proxycheckedlist. get_nowait () wait t exception, E: Print e else: Self. checkedproxylist. append (content) print "The checked proxylist contains: % s Nums records" % Len (self. checkedproxylist) for info in self. checkedproxylist: Print info def sort (Self): sorted (self. checkedproxylist, CMP = Lambda X, Y: CMP (X [4], Y [4]) # Sort the proxy IP list of the response time by def save (Self ): F = open (self. fname, 'W + ') for proxy in self. checkedproxylist: f. write ("% s: % s \ t % s \ n" % (proxy [0], proxy [1], proxy [2], proxy [3], proxy [4]) F. close () def run (Self): Self. checkproxy () self. sort () self. save () print 'done'

This class mainly inherits from the Thread class threading. Its program flow mainly depends on run (), and its process is the above analysis idea. Next, we will briefly describe the checkproxy () process. It creates 10 threads, uses the map function to start these 10 threads at the same time, and finally waits for the thread to exit. Then, the data in the proxycheckedlist in the queue is processed again. These 10 threads share the same job. They read 10 HTTP proxy data (IP address, port, etc.) from a queue's proxycheckedlist ), then, the validity of the 10 pieces of data is judged iteratively. The validity judgment ideas are as follows:

Use urllib2.httpcookieprocessor () to create a cookie,

Create a proxy handle object proxy_handler = urllib2.proxyhandler ({"HTTP": r'http: // % s: % s' % (proxy [0], proxy [1])})

Bind the proxy handle to the cookie opener = urllib2.build _ opener (cookies, proxy_handler)

Request object installation urllib2.install _ opener (opener)

Finally, the proxy IP address is used to access a website. If data is returned within a specified period of time, it indicates that it is valid and is placed in the queue proxycheckedlist. Otherwise, the next iteration is performed. The following code is used:

Class ck_process (threading. thread): ''' Thread class: used for multithreading to verify the validity of the proxy IP address ''' def _ init _ (self, test_url, test_str, timeout, count): threading. thread. _ init _ (Self) self. proxy_contain = [] self. test_url = test_url self. test_str = test_str self. checkedproxylist = [] self. timeout = timeout self. count = count def run (Self): cookies = urllib2.httpcookieprocessor () # construct a cookie object # print "I'm thread process no. % s "% self. Count while proxylist1.empty () = false: If lock_que.acquire (): # the lock is obtained successfully. If proxylist1.qsize () >=10: Number = 10 else: Number = proxylist1.qsize () for I in range (number): # obtain 10 proxy IP addresses from the original list queue. Proxy = proxylist1.get _ Nowait () self. proxy_contain.append (proxy) # print "% s thread process: % s" % (self. count, self. proxy_contain) # print lock_que.release () # Each thread processes 10 for proxy in self each time. proxy_contain: proxy _ Handler = urllib2.proxyhandler ({"HTTP": R 'HTTP: // % s: % s' % (proxy [0], proxy [1])}) # construct a proxy handle object opener = urllib2.build _ opener (cookies, proxy_handler) # bind the proxy handle to the cookie # simulate a browser and add the HTTP User-Agent header field, the parameter is put into the list of tuples opener. addheaders = [('user-agent', 'mozilla/5.0 (Windows NT 5.1) applewebkit/537.31 (khtml, like gecko) Chrome/26.0.1410.64 Safari/100')] urllib2.install _ opener (opener) # request to register T1 = time. time () # Get Curren T time try: # Sometimes some proxies cannot open the website Req = urllib2.urlopen (self. test_url, timeout = self. timeout) Result = req. read () timeused = time. time ()-T1 Pos = result. find (self. test_str) If POS> 1: Self. checkedproxylist. append (proxy [0], proxy [1], proxy [2], proxy [3], timeused) else: Continue failed t exception, E: # print E. message continue if Len (self. checkedproxylist )! = 0: If lock_que_cked.acquire (): For proxy in self. checkedproxylist: proxycheckedlist. put (proxy) lock_que_cked.release () print "% s thread process: Out: % s Nums" % (self. count, Len (self. checkedproxylist ))

(End)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.