My first small crawler program-python

Last Update:2018-07-28 Source: Internet

Author: User

Tags server port

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Climb anything. Server port proxy for the Crawling proxy Web site agent type location Update Date today score the overall rating available speed assessment information, such a Web page has seven or eight, fortunately, the site Ming name is very regular

In particular, the information in this HTML code that crawls a lot of

<span class= "Tbbottomline" style= "width:140px;" >
     58.222.254.11
 </span>
 <span class= "Tbbottomline" style= "width:50px;" >
         3128
 </span>
 <span class= "Tbbottomline" style= "width:70px;" >
     High Hide
 </span>
 <span class= "Tbbottomline" style= "width:70px;" >
     China
 </span>
 <span class= "Tbbottomline" style= "width:80px;" >
     September 05
 </span>
 <span class= "Tbbottomline" style= "width:80px;" >
     3.44 (70 votes)
 </span>
 <span class= "Tbbottomline" style= "width:60px;" >
     3.44
 </span>
 <span class= "Tbbottomline" style= "width:30px;" >
     14 days
 </span>

How to climb. Use this library in Python urllib2, of course, also use regular, here not to say that my is very complex (it is really unexpected how to simplify)

Import re import urllib2 orgi=r ' http://www.proxy360.cn/Region/' proxywebs=[', ' Brazil ', ' America ', ' Taiwan ', ' Japan ', ' Thailand ', ' Vietnam ', ' Bahrein '] def get_proxy_info (): F=open (' proxy.txt ', ' w+ ') p=re.compile ("<span[^&gt ;] *>[\s]* (\s+?) [\s]*</span>[\s]*<span[^>]*>[\s]* (\s+?) [\s]*</span>[\s]*<span[^>]*>[\s]* (\s+?) [\s]*</span>[\s]*<span[^>]*>[\s]* (\s+?) [\s]*</span>[\s]*<span[^>]*>[\s]* (\s+?) [\s]*</span>[\s]*<span[^>]*>[\s]* (\s+?) [\s]*</span>[\s]*<span[^>]*>[\s]* (\s+?) [\s]*</span>[\s]*<span[^>]*>[\s]* (\s+?) [\s]*</span> "") ' ' Server port agent kind of area update date today's score the overall rating available speed assessment ' for s ' in proxywebs:target=orgi+
        S Req=urllib2.urlopen (target) Mystr=req.read () Tmp=p.findall (mystr) for T in Tmp:f.writ E (' \ t '. Join (t)) f.write (' \ n ') F.flush () f.close () if __name__== "__main__": GET_PROxy_info ()

Principle, or process
Super simple, get the source code of the specified page, the return type is a string, we use regular expression to match to the part we want to grab, use the group to extract further, then save or after the data analysis, obviously this step I did not do

Attention matters

The encoding of the Web page is generally UTF8, in case it is not. When I write, save the time garbled, and then know to pay attention to the encoding format

Regular writing, this Test foundation, in fact, or else, you can be very dead to write, although I do not recommend, but I will only write this in the beginning

<span class= "Tbbottomline" style= "width:140px;" >\r\n (. +?) \r\n</span>
<span class= "Tbbottomline" style= "width:50px"; >\r\n (. +?) \r\n</span>
<span class= "Tbbottomline" style= "width:70px"; >\r\n (. +?) \r\n</span>
<span class= "Tbbottomline" style= "width:70px"; >\r\n (. +?) \r\n</span>
<span class= "Tbbottomline" style= "width:80px"; >\r\n (. +?) \r\n</span>
<span class= "Tbbottomline" style= "width:80px"; >\r\n (. +?) \r\n</span>
<span class= "Tbbottomline" style= "width:60px"; >\r\n (. +?) \r\n</span>
<span class= "Tbbottomline" style= "width:30px"; >\r\n (. +?) \r\n</span>

It's bad, but I can climb down and-_-# results.

Note: Environment: python2.7 wingide

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More