Tag: Print OCS art request returns LTE void Date tip
Web crawler commonly used to crawl on the Internet pages or files of interest, combined with data processing and analysis technology can get deeper information. The following code implements a web crawler that can crawl all links in a specified Web page, and can specify keywords and crawl depths.
1 ImportSYS2 ImportMultiprocessing3 ImportRe4 ImportOS5 ImportUrllib.request as Lib6 7 defcraw_links (url,depth,keywords,processed):8 " "9 :p aram URL: The URL to crawlTen :p Aram Depth: Crawl depth One :p Aram Keywords: a tuple of keywords to crawl A :p Aram procdssed: Process Pool - : return: - " " the -Contents = [] - - ifUrl.startswith (('/ http','https://')): + ifUrl not inchprocessed: - #Make this URL as processed + processed.append (URL) A Else: at #avoid processing the same URL again - return - - Print('crawing'+ URL +'...') -fp = lib.urlopen (URL)#making a request to a URL - in #Python3 returns BYTES,SO need to decode -contents_decoded = Fp.read (). Decode ('Utf-8') toFp.close ()#crawled Web page text content has been read at this point + -Pattern ='|'. Join (keywords) the * #If this page contains certain keywords,save it to a file $Flag =FalsePanax Notoginseng ifPattern: -searched = Re.search (pattern,contents_decoded)#use regular expressions to go back to the text of the page to match the keyword the Else: + #If the keywords to filter isn't given,save current page AFlag =True the + ifFlagorsearched: -With open ('craw\\'+ Url.replace (':','_'). Replace ('/','_'),'W') as FP: $ fp.writelines (contents) $ - #Find all links on the current page -Links = Re.findall ('href= "(. *?)"', contents_decoded) the - #Craw All links on the current pageWuyi forLinkinchLinks: the #consider the relative path - if notLink.startswith (('/ http','https://')): Wu Try: -index = Url.rindex ('/') Aboutlink = url[0:index+1] +Link $ except: - Pass - ifDepth > 0 andLink.endswith (('. htm','. html')): -Craw_links (link,depth-1, keywords,processed) A + if __name__=='__main__': theprocessed = [] -Keywords= ('datetime','KeyWord2') $ if notOs.path.exists ('Craw')or notOs.path.isdir ('Craw'): theOs.mkdir ('Craw') theCraw_links (R'https://docs.python.org/3/library/index.html', 1,keywords,processed)
9.3.2 web crawler