In accordance with the previous Python Learning (ii) approach, the first step is to automatically get the information from the specified list from a page on a Web site. Toss a few days, get a piece of code that can run normally, as follows:
1 #web2.py2 3 ImportRe4 Importurllib.request5 6 defGet_msg_for_url (s):7 8 ifs = ="':9 Print("Not url!\n")Ten exit () One AAh_whdeps_url = {"AHSWHT":"http://www.ahwh.gov.cn/"} - -msg_from={"AHSWHT":"http://www.ahwh.gov.cn/zz/shwhc/gzdt5/"} themsg_re={"ahswht1": R'<div class= "title" ><a href= "(. *)" title= "(. *)" target= "_blank" >', - "ahswht2": R'<div class= "Time" >\[(. *) \]</div>' } - -Gettotalpagere =r'The current (. *) page of the total (. *) Section </div>' + -Res_url ='aspx/doview.aspx?siteid=52&contentid=0&channelid=432&pchannelid=399&templatetype=2& Page=' +Response=Urllib.request.urlopen (Msg_from[s]) Ahtml = Response.read (). Decode ("GBK")#decoding a webpage with Chinese characters at -GTPR = Re.findall (gettotalpagere, HTML)#get the total number of information and pages -msg_totle = Int (Gtpr[0][0].strip ())#get the numbers of message -page_totle = Int (Gtpr[0][1].split ('/') [1].strip ())#get the numbers of page - -m = 1 in -filename = s+'. txt' tof = open (filename,"W") + - forNinchRange (1, page_totle+1): theResponse = Urllib.request.urlopen (ah_whdeps_url["AHSWHT"]+res_url+str (n)) *html = Response.read (). Decode ('Utf-8','Ignore') $ Panax NotoginsengGT = Re.findall (msg_re[s+'1'], HTML)#extract the title and URL of the news in the URL page - theGD = Re.findall (msg_re[s+'2'], HTML)#Extract the Publish time from the URL page + A forIinchRange (0,len (GT), 1): the Try: +F.write ('%d\n%s\n%s\n%s\n'% (m,ah_whdeps_url[s]+gt[i][0][1:], gt[i][1], Gd[i])) -M + = 1 $ exceptUnicodeencodeerror as E: $ Pass - - Print("There is%d messages to be saved!"% (m-1)) the f.close () - Wuyi the defMain (): -Get_msg_for_url ("AHSWHT") Wu - if __name__=="__main__": AboutMain ()
Code function Description: line 12th, define the specified website; line 14th defines the specified Web page. The 15th row defines a regular expression for extracting information, which is the title and URL for each message, such as. Line 16th, which defines the regular expression that extracts the time of each message publication.
msg_re={"ahswht1": R ' <div class= "title" ><a href= "(. *)" title= "(. *)" target= "_blank" > ", 16 "ahswht2": R ' <div class= "Time" >\[(. *) \]</div> ' }
第18-26: Gets the number of bars and pages for all information in this list. You do this by parsing a specific string on the page: "The total section is currently on page". The 25th line gets the total number of messages, and the 26th line gets the number of pages.
第30-40: Gets all the information under the list and saves it to a text file (such as) in the form of [line number \ n \ n [] \ n].
Personal difficulty: The most time to write this code is the 35th line, that is, the characters on the page encoding and decoding problems. Now is the expedient, direct disregard, that is, once the string of a header row is not decoded, jump to the next heading up, rather than let the program break. html = Response.read (). Decode (' utf-8 ', ' ignore ') Our slogan is: "First let the program move, then let it run faster!"
——————————————————————
In the first step, you can separate the regular expression for the specified page, not hardcoded in the program, but save it in a separate text file. In this case, if you want to get another site to specify the information below the column, just need to add the corresponding rules in the text file, no longer important to write code.
As a result, the whole project is clearly divided into three parts: one is the input file; the other is the processing module; The third is the output file. wherein, the input file definition obtains the information the rule, the processing module is responsible for reads the information extraction rule from the input file, obtains the related information according to the rule, then the obtained information in the fixed format, the output file saves the information.
Python Learning (iii)