Tag:decode designation com pat port ubi Enterprise txt content
#-*-coding:utf-8-*-#读取北京FDA的药品经营企业数据 # 20161125 zhangshaohuaimport reimport urllib.requestimport urllib.parseimport Osdef getcontent (url,pat,charset): #指定网址, regular expression, encoding, returns the specified content page = urllib.request.urlopen (URL) content = Page.read (). Decode (CharSet) pattern = Re.compile (PAT) result = Re.findall (pattern,content) return result# read Home url = ' http:/ /www.bjda.gov.cn/eportal/ui?pageid=331148 ' #取总记录数, 20 per page ZJLS = getcontent (URL, ' Total number of records: (\d{1,5}), ', ' UTF-8 ') vdzjls = int (Zjls[0]) VDZJLS = Int (round (vdzjls/20,0)) for I in Range (51,VDZJLS): url = ' http://www.bjda.gov.cn/eportal/ui?pageId=33 1148¤tpage= ' +str (i) pattern = ' artileid= (. *) ' > View ' page_id = getcontent (Url,pattern, ' UTF-8 ') for Url_ ID in page_id:try:subid = url_id Suburl = "http://www.bjda.gov.cn/eportal/ui?pageId=331631& Amp;artileid= "+subid qymc = getcontent (Suburl, ' Enterprise name:</th>\r\n.*?<td> (. *?) </td> ', ' UTF-8 ') Zcdz = getcontent (Suburl, ' registered placeAddress:</th>\r\n.*?<td> (. *?) \s{0,3}</td> ', ' UTF-8 ') Xkzh = getcontent (suburl, ' License number:</th>\r\n.*?<td> (. *?) </td> ', ' UTF-8 ') print (qymc,zcdz,xkzh) file_object = open (' Bjda.txt ', ' a ') File_object.write (Qymc[0]) file_object.write (', ') File_object.write (zcdz[0]) file_objec T.write (', ') File_object.write (xkzh[0]) file_object.write (' \n\r ') fin Ally:none file_object.close () vdzjls = Int (zjls[0]) print (' Drug retail enterprise Read done! ‘)
After reading Hda's practice, the reading of BJ's data began to be more smooth. An error occurred while reading 996 data, and again the problem caused by line break;
After repeated trial and error with ' \s{0,3} ' successfully resolved.
Regular expression to continue to learn, in order to keep improving, to avoid encountering "pit" "when the Smooth passage!"
Python3 read Bjda drug business enterprise data