1 ImportJSON2 ImportRe3 fromUrllib.requestImportUrlopen#urllib usage: https://www.jb51.net/article/65279.htm4 5 #idea: Get Web content by URL-"Match needs content---" Get content to write to file6 7 8 defget_page (URL):9 """Ten Get the page code string to manipulate One :p aram URL: incoming URL A : Return: Returns UTF encoded string - """ - #Respond object has a method called Read (), read it Out is a bytes type of data, need to transcode theRespond =urlopen (URL) - returnRespond.read (). Decode ('Utf-8') - #if not, it will return an object - + #The resulting string is passed in, the required content is matched by a regular match, and the returned - defparse_page (S_strfile, pattern): + """ A match the incoming string with regular to get the desired content at to save time, because each time you use the same regular rules to match what I want, you can configure the regular to be an object, and then the object will check the value by invoking the method - to save space, use iterators to take values and encapsulate objects into a generator, one at a time, saving memory - :p Aram S_strfile: - : return: - """ - #com = re.compile (' <td class= "s" >.*?<a href=.*?> (? P<x_name>.*?) </a>.*?<a href=.*?> (? P<x_title>.*?) </a> ' in #'. *?<td class= "T" > (? P<x_time>.*?) </td> ', Re. S) - #two rows and one line of effect to #com = re.compile ( + #' <td class= "s" >.*?<a href=.*?> (? P<x_name>.*?) </a>.*?<a href=.*?> (? P<x_title>.*?) </A>.*?<TD class= "T" > (? P<x_time>.*?) </td> ', - #Re. S) the #above is to build the regular through method compile into an object * $ #try to get it all out by findall, but it takes up a lot of memory, so find the result store, plan to build a generator, take one at a timePanax Notoginseng #page = Com.findall (s_strfile) - #Print (page) the +ret = Pattern.finditer (s_strfile)#This method saves memory compared to FindAll, and takes value with all A forIinchret: the yield{'name': I.group ('X_name'), + 'title': I.group ('X_title'), - ' Time': I.group ('X_time')} $ $ - defMain (Page_num, pattern): - """ the receive run times and regular rules, write files - :p Aram Page_num:Wuyi :p Aram Pattern: the : return: - """ WuURL ='http://booksky.99lb.net/sodupaihang/page%s'%Page_num -Response_html_code =get_page (URL) AboutRET =parse_page (Response_html_code, pattern) $With open ('Xiaoshuo_info.txt','a', encoding='Utf-8') as F: - forDatainchret: -WRITE_LINE_STR = json.dumps (data, Ensure_ascii=false)#JSON is a string -F.write ("'. join ([Write_line_str,'\ n'])) A + the #compile regular rules for an object, put in global variables, just compile once, save time -Pattern =Re.compile ( $ '<td class= "s" >.*?<a href=.*?> (? P<x_name>.*?) </a>.*?<a href=.*?> (? P<x_title>.*?) the </a>.*?<td class= "T" > (? P<x_time>.*?) </td>', the Re. S) the the if __name__=='__main__': - forNuminchRange (1, 11): inMain (num, pattern)
Python crawler Learning: First Crawl _ Quick Glance book rankings