Python3 web crawler
1. Direct use of Python3
A simple pseudo-code
The following simple pseudo-code uses the two classic data structures, set and queue, for set and queue. The role of the set is to record those pages that have been visited, and the role of the queue is to perform a breadth-first search.
1234567891011 |
Queue Qset sstartpoint = "http://jecvay.com" Q.push (StartPoint) # Classic BFS opening S.insert (StartPoint) # before accessing a page, mark him as visited while ( Q.empty () = false) # BFS loop Body T = Q.top () # and Pop for point in Pageurl (t) # Pageurl (t) is a collection of all the URLs in page T, which is an element of this collection . If (point not in S) Q.push (point) s.insert (point) |
Here the set of its internal principle is the use of the hash table, the traditional hash of the crawler to occupy too much space, so there is a data structure called Bloom Filter is more suitable for use here instead of the hash version of the set.
A simple Webspider implementation
1 fromHtml.parserImportHtmlparser2 fromUrllib.requestImportUrlopen3 fromUrllibImportParse4 5 classLinkparser (htmlparser):6 defHandle_starttag (self, Tag, attrs):7 ifTag = ='a':8 for(Key, value)inchAttrs:9 ifKey = ='href':TenNewurl =Parse.urljoin (Self.baseurl, value) OneSelf.links = Self.links +[Newurl] A - defgetlinks (self, url): -Self.links = [] theSelf.baseurl =URL -Response =urlopen (URL) - ifResponse.getheader ('Content-type')=='text/html': -Htmlbytes =Response.read () +htmlstring = Htmlbytes.decode ("Utf-8") - self.feed (htmlstring) + returnhtmlstring, Self.links A Else: at return "", [] - - defspider (URL, Word, maxPages): -Pagestovisit =[url] -numbervisited =0 -Foundword =False in whilenumbervisited < MaxPages andPagestovisit! = [] and notFoundword: -numbervisited = numbervisited + 1 toURL =Pagestovisit[0] +Pagestovisit = pagestovisit[1:] - Try: the Print(Numbervisited,"Visiting:", URL) *Parser =Linkparser () $Data, links =parser.getlinks (URL)Panax Notoginseng ifData.find (Word) >-1: -Foundword =True thePagestovisit = Pagestovisit +links + Print("**success!**") A except: the Print("**failed!**") + - ifFoundword: $ Print("The word"Word"Was found at", URL) $ return - Else: - Print("Word never found")
View Code
Attached: (Python assignment and module use)
# Assign Values Directlya, b = 0, 1assert a = = 0assert b = = 1 # Assign values from a list (r,g,b) = ["Red", "Green", "Blu E "]assert r = =" Red "assert g = =" Green "assert b = =" Blue " # Assign values from a tuple (x, y) = (from) assert x = = 1assert y = = 2
In the sibling directory open Python, enter execute the following statement
$ import Webspiderwebspider.spider ("http://baike.baidu.com", ' Yangcheng ', 1000)
2. Using the Scrapy framework
Installation
Environmental dependencies:
OpenSSL, LIBXML2
Installation method: Pip Install Pyopenssl lxml
$pip Install scrapy cat > myspider.py <<eofimport scrapyclass blogspider (scrapy. Spider): name = ' Blogspider ' start_urls = [' http://blog.scrapinghub.com '] def parse (self, Response): For URLs in Response.css (' ul li A::attr ("href") '). Re (R '. */\d\d\d\d/\d\d/$ '): yield scrapy. Request (Response.urljoin (URL), self.parse_titles) def parse_titles (self, Response): for Post_title in Response.css (' Div.entries > Ul > Li a::text '). Extract (): yield {' title ': Post_title}eof scrapy Runspider myspider.py
Resources:
Https://jecvay.com/2014/09/python3-web-bug-series1.html
http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/
Http://www.jb51.net/article/65260.htm
http://scrapy.org/
Https://docs.python.org/3/tutorial/modules.html
Python3 Web Crawler