Python Learning (iii)

Last Update:2016-05-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In accordance with the previous Python Learning (ii) approach, the first step is to automatically get the information from the specified list from a page on a Web site. Toss a few days, get a piece of code that can run normally, as follows:

1 #web2.py2 3 ImportRe4 Importurllib.request5 6 defGet_msg_for_url (s):7     8     ifs = ="':9         Print("Not url!\n")Ten exit () One      AAh_whdeps_url = {"AHSWHT":"http://www.ahwh.gov.cn/"} -      -msg_from={"AHSWHT":"http://www.ahwh.gov.cn/zz/shwhc/gzdt5/"} themsg_re={"ahswht1": R'<div class= "title" ><a href= "(. *)" title= "(. *)" target= "_blank" >', -             "ahswht2": R'<div class= "Time" >\[(. *) \]</div>'    } -      -Gettotalpagere =r'The current (. *) page of the total (. *) Section </div>' +      -Res_url ='aspx/doview.aspx?siteid=52&contentid=0&channelid=432&pchannelid=399&templatetype=2& Page=' +Response=Urllib.request.urlopen (Msg_from[s]) Ahtml = Response.read (). Decode ("GBK")#decoding a webpage with Chinese characters at  -GTPR = Re.findall (gettotalpagere, HTML)#get the total number of information and pages -msg_totle = Int (Gtpr[0][0].strip ())#get the numbers of message -page_totle = Int (Gtpr[0][1].split ('/') [1].strip ())#get the numbers of page -      -m = 1 in      -filename = s+'. txt' tof = open (filename,"W") +      -      forNinchRange (1, page_totle+1): theResponse = Urllib.request.urlopen (ah_whdeps_url["AHSWHT"]+res_url+str (n)) *html = Response.read (). Decode ('Utf-8','Ignore') $         Panax NotoginsengGT = Re.findall (msg_re[s+'1'], HTML)#extract the title and URL of the news in the URL page -          theGD = Re.findall (msg_re[s+'2'], HTML)#Extract the Publish time from the URL page +      A          forIinchRange (0,len (GT), 1): the             Try: +F.write ('%d\n%s\n%s\n%s\n'% (m,ah_whdeps_url[s]+gt[i][0][1:], gt[i][1], Gd[i])) -M + = 1 $             exceptUnicodeencodeerror as E: $                 Pass         -      -     Print("There is%d messages to be saved!"% (m-1)) the f.close () -     Wuyi  the defMain (): -Get_msg_for_url ("AHSWHT") Wu      - if __name__=="__main__": AboutMain ()

Code function Description: line 12th, define the specified website; line 14th defines the specified Web page. The 15th row defines a regular expression for extracting information, which is the title and URL for each message, such as. Line 16th, which defines the regular expression that extracts the time of each message publication.

     msg_re={"ahswht1": R ' <div class= "title" ><a href= "(. *)" title= "(. *)" target= "_blank" > ", 16             "ahswht2": R ' <div class= "Time" >\[(. *) \]</div> '    }

第18-26: Gets the number of bars and pages for all information in this list. You do this by parsing a specific string on the page: "The total section is currently on page". The 25th line gets the total number of messages, and the 26th line gets the number of pages.

第30-40: Gets all the information under the list and saves it to a text file (such as) in the form of [line number \ n \ n [] \ n].

Personal difficulty: The most time to write this code is the 35th line, that is, the characters on the page encoding and decoding problems. Now is the expedient, direct disregard, that is, once the string of a header row is not decoded, jump to the next heading up, rather than let the program break. html = Response.read (). Decode (' utf-8 ', ' ignore ') Our slogan is: "First let the program move, then let it run faster!"

——————————————————————



In the first step, you can separate the regular expression for the specified page, not hardcoded in the program, but save it in a separate text file. In this case, if you want to get another site to specify the information below the column, just need to add the corresponding rules in the text file, no longer important to write code.

As a result, the whole project is clearly divided into three parts: one is the input file; the other is the processing module; The third is the output file. wherein, the input file definition obtains the information the rule, the processing module is responsible for reads the information extraction rule from the input file, obtains the related information according to the rule, then the obtained information in the fixed format, the output file saves the information.

Python Learning (iii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Learning (iii)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support