Compile web crawler in Python

Last Update:2018-12-05 Source: Internet

Author: User

Tags server website

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Preparations

To complete a web crawler applet, you need to prepare the following:

1. Understand basic HTTP protocols

2. Familiar with urllib2 library interface

3. Familiar with Python Regular Expressions

Ii. Programming ideas

Here is just a basic web crawler program. Its basic ideas are as follows:

1. Find the webpage to be crawled, view its source code, and analyze the HTML rules of the webpage.

2. Use the urllib2 library to read the required webpage

3. Use regular expressions to correctly extract the required webpage Information

4. Check the validity of the obtained webpage data, that is, information screening.

5. Store valid data information, such as files or databases.

3. Crawling Web Proxy Server crawler instances

Find a proxy server website, such. It mainly uses regular expressions and urllib2 libraries. The Code is as follows:

Proxylist1 = queue. queue () # used to store all the server proxy IP addresses and port numbers captured from the page portdicts = {'Z': "3", 'M': "4", 'A ': "2", 'L': "9", 'F': "0", 'B': "5", 'I': "7", 'w ': "6", 'x': "8", 'C': "1", 'r': "8", 'D': "0"} def get_proxy_from_cnproxy (): Global proxylist1 P = Re. compile (r''' <tr> <TD> (. + ?) <SCRIPT type = text/JavaScript> document. Write \ (":" \ + (. + ?) \) </SCRIPT> </TD> <TD> (. + ?) </TD> <TD>. +? </TD> <TD> (. + ?) </TD> </tr> ''') for I in range (): Target = r "http://www.cnproxy.com/proxy%d.html" % I print target Req = urllib2.urlopen (target) Result = req. read () matchs = P. findall (result) # print matchs for row in matchs: IP = row [0] Port = row [1] If port is none: Continue TMP = port. split ('+') # The port in the HTML cannot find the values corresponding to some keys. filter out these IP addresses. Flag = 0 for X in TMP: If x not in portdicts: flag = 1 break if flag = 1: Continue Port = map (lambda X: portdicts [X], port. split ('+') Port = ''. join (port) agent = row [2] ADDR = row [3]. decode ("cp936 "). encode ("UTF-8") L = [IP, port, agent, ADDR] print l proxylist1.put (l) print "page 1-10 size: % s Nums proxy info "% proxylist1.qsize ()

The above code extracts the required webpage fields and stores them in the queue proxylist1. Next, you need to verify each field in the queue proxylist1 to determine the validity of the data in it, and then store the detected data in the proxycheckedlist of another queue, sort the valid data information and save it to the file. The Code is as follows:

Proxycheckedlist = queue. queue () # used to store the valid proxy IP address and port number information obtained after verification class proxycheck (threading. thread): def _ init _ (self, fname): threading. thread. _ init _ (Self) self. timeout = 5 # self. test_url = "http://www.baidu.com/" # self. test_str = "030173" # self. test_url = "http://www.so.com/" # self. test_str = '000000' self. test_url = "http://www.renren.com" self. test_str = "110000000009" self. fname = fname self. checkedproxylist = [] def checkproxy (Self): threadpool = [] For I in range (10): # create 10 threads and put them in the threadpool. append (ck_process (self. test_url, self. test_str, self. timeout, I) # Start 10 threads to process the verification work at the same time map (lambda X: X. start (), threadpool) # Wait for the thread to exit map (lambda X: X. join (), threadpool) while proxycheckedlist. empty () = false: Try: content = proxycheckedlist. get_nowait () wait t exception, E: Print e else: Self. checkedproxylist. append (content) print "The checked proxylist contains: % s Nums records" % Len (self. checkedproxylist) for info in self. checkedproxylist: Print info def sort (Self): sorted (self. checkedproxylist, CMP = Lambda X, Y: CMP (X [4], Y [4]) # Sort the proxy IP list of the response time by def save (Self ): F = open (self. fname, 'W + ') for proxy in self. checkedproxylist: f. write ("% s: % s \ t % s \ n" % (proxy [0], proxy [1], proxy [2], proxy [3], proxy [4]) F. close () def run (Self): Self. checkproxy () self. sort () self. save () print 'done'

This class mainly inherits from the Thread class threading. Its program flow mainly depends on run (), and its process is the above analysis idea. Next, we will briefly describe the checkproxy () process. It creates 10 threads, uses the map function to start these 10 threads at the same time, and finally waits for the thread to exit. Then, the data in the proxycheckedlist in the queue is processed again. These 10 threads share the same job. They read 10 HTTP proxy data (IP address, port, etc.) from a queue's proxycheckedlist ), then, the validity of the 10 pieces of data is judged iteratively. The validity judgment ideas are as follows:

Use urllib2.httpcookieprocessor () to create a cookie,

Create a proxy handle object proxy_handler = urllib2.proxyhandler ({"HTTP": r'http: // % s: % s' % (proxy [0], proxy [1])})

Bind the proxy handle to the cookie opener = urllib2.build _ opener (cookies, proxy_handler)

Request object installation urllib2.install _ opener (opener)

Finally, the proxy IP address is used to access a website. If data is returned within a specified period of time, it indicates that it is valid and is placed in the queue proxycheckedlist. Otherwise, the next iteration is performed. The following code is used:

Class ck_process (threading. thread): ''' Thread class: used for multithreading to verify the validity of the proxy IP address ''' def _ init _ (self, test_url, test_str, timeout, count): threading. thread. _ init _ (Self) self. proxy_contain = [] self. test_url = test_url self. test_str = test_str self. checkedproxylist = [] self. timeout = timeout self. count = count def run (Self): cookies = urllib2.httpcookieprocessor () # construct a cookie object # print "I'm thread process no. % s "% self. Count while proxylist1.empty () = false: If lock_que.acquire (): # the lock is obtained successfully. If proxylist1.qsize () >=10: Number = 10 else: Number = proxylist1.qsize () for I in range (number): # obtain 10 proxy IP addresses from the original list queue. Proxy = proxylist1.get _ Nowait () self. proxy_contain.append (proxy) # print "% s thread process: % s" % (self. count, self. proxy_contain) # print lock_que.release () # Each thread processes 10 for proxy in self each time. proxy_contain: proxy _ Handler = urllib2.proxyhandler ({"HTTP": R 'HTTP: // % s: % s' % (proxy [0], proxy [1])}) # construct a proxy handle object opener = urllib2.build _ opener (cookies, proxy_handler) # bind the proxy handle to the cookie # simulate a browser and add the HTTP User-Agent header field, the parameter is put into the list of tuples opener. addheaders = [('user-agent', 'mozilla/5.0 (Windows NT 5.1) applewebkit/537.31 (khtml, like gecko) Chrome/26.0.1410.64 Safari/100')] urllib2.install _ opener (opener) # request to register T1 = time. time () # Get Curren T time try: # Sometimes some proxies cannot open the website Req = urllib2.urlopen (self. test_url, timeout = self. timeout) Result = req. read () timeused = time. time ()-T1 Pos = result. find (self. test_str) If POS> 1: Self. checkedproxylist. append (proxy [0], proxy [1], proxy [2], proxy [3], timeused) else: Continue failed t exception, E: # print E. message continue if Len (self. checkedproxylist )! = 0: If lock_que_cked.acquire (): For proxy in self. checkedproxylist: proxycheckedlist. put (proxy) lock_que_cked.release () print "% s thread process: Out: % s Nums" % (self. count, Len (self. checkedproxylist ))

(End)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More