9.3.2 web crawler

Source: Internet
Author: User

Tag: Print OCS art request returns LTE void Date tip

Web crawler commonly used to crawl on the Internet pages or files of interest, combined with data processing and analysis technology can get deeper information. The following code implements a web crawler that can crawl all links in a specified Web page, and can specify keywords and crawl depths.

1 ImportSYS2 ImportMultiprocessing3 ImportRe4 ImportOS5 ImportUrllib.request as Lib6 7 defcraw_links (url,depth,keywords,processed):8     " "9 :p aram URL: The URL to crawlTen :p Aram Depth: Crawl depth One :p Aram Keywords: a tuple of keywords to crawl A :p Aram procdssed: Process Pool - : return: -     " " the  -Contents = [] -  -     ifUrl.startswith (('/ http','https://')): +         ifUrl not inchprocessed: -             #Make this URL as processed + processed.append (URL) A         Else: at             #avoid processing the same URL again -             return -  -         Print('crawing'+ URL +'...') -fp = lib.urlopen (URL)#making a request to a URL -  in         #Python3 returns BYTES,SO need to decode -contents_decoded = Fp.read (). Decode ('Utf-8') toFp.close ()#crawled Web page text content has been read at this point +  -Pattern ='|'. Join (keywords) the  *         #If this page contains certain keywords,save it to a file $Flag =FalsePanax Notoginseng         ifPattern: -searched = Re.search (pattern,contents_decoded)#use regular expressions to go back to the text of the page to match the keyword the         Else: +             #If the keywords to filter isn't given,save current page AFlag =True the  +         ifFlagorsearched: -With open ('craw\\'+ Url.replace (':','_'). Replace ('/','_'),'W') as FP: $ fp.writelines (contents) $  -         #Find all links on the current page -Links = Re.findall ('href= "(. *?)"', contents_decoded) the  -         #Craw All links on the current pageWuyi          forLinkinchLinks: the             #consider the relative path -             if  notLink.startswith (('/ http','https://')): Wu                 Try: -index = Url.rindex ('/') Aboutlink = url[0:index+1] +Link $                 except: -                     Pass -             ifDepth > 0 andLink.endswith (('. htm','. html')): -Craw_links (link,depth-1, keywords,processed) A  + if __name__=='__main__': theprocessed = [] -Keywords= ('datetime','KeyWord2') $     if  notOs.path.exists ('Craw')or  notOs.path.isdir ('Craw'): theOs.mkdir ('Craw') theCraw_links (R'https://docs.python.org/3/library/index.html', 1,keywords,processed)

9.3.2 web crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.