Python3 Web Crawler

Source: Internet
Author: User
Tags assert

Python3 web crawler

1. Direct use of Python3

A simple pseudo-code

The following simple pseudo-code uses the two classic data structures, set and queue, for set and queue. The role of the set is to record those pages that have been visited, and the role of the queue is to perform a breadth-first search.

1234567891011 Queue Qset sstartpoint = "http://jecvay.com" Q.push (StartPoint) # Classic BFS opening S.insert (StartPoint) # before accessing a page, mark him as visited while ( Q.empty () = false) # BFS loop Body T = Q.top () # and Pop for point in Pageurl (t) # Pageurl (t) is a collection of all the URLs in page T, which is an element of this collection    . If (point not in S) Q.push (point) s.insert (point)

Here the set of its internal principle is the use of the hash table, the traditional hash of the crawler to occupy too much space, so there is a data structure called Bloom Filter is more suitable for use here instead of the hash version of the set.

A simple Webspider implementation

1  fromHtml.parserImportHtmlparser2  fromUrllib.requestImportUrlopen3  fromUrllibImportParse4 5 classLinkparser (htmlparser):6     defHandle_starttag (self, Tag, attrs):7         ifTag = ='a':8              for(Key, value)inchAttrs:9                 ifKey = ='href':TenNewurl =Parse.urljoin (Self.baseurl, value) OneSelf.links = Self.links +[Newurl] A      -     defgetlinks (self, url): -Self.links = [] theSelf.baseurl =URL -Response =urlopen (URL) -         ifResponse.getheader ('Content-type')=='text/html': -Htmlbytes =Response.read () +htmlstring = Htmlbytes.decode ("Utf-8") - self.feed (htmlstring) +             returnhtmlstring, Self.links A         Else: at             return "", [] -  - defspider (URL, Word, maxPages): -Pagestovisit =[url] -numbervisited =0 -Foundword =False in      whilenumbervisited < MaxPages andPagestovisit! = [] and  notFoundword: -numbervisited = numbervisited + 1 toURL =Pagestovisit[0] +Pagestovisit = pagestovisit[1:] -         Try: the             Print(Numbervisited,"Visiting:", URL) *Parser =Linkparser () $Data, links =parser.getlinks (URL)Panax Notoginseng             ifData.find (Word) >-1: -Foundword =True thePagestovisit = Pagestovisit +links +             Print("**success!**") A         except: the             Print("**failed!**") +          -         ifFoundword: $             Print("The word"Word"Was found at", URL) $             return -         Else: -             Print("Word never found")
View Code

Attached: (Python assignment and module use)

    • Assign value
# Assign Values Directlya, b = 0, 1assert a = = 0assert b = = 1  # Assign values from a list (r,g,b) = ["Red", "Green", "Blu  E "]assert r = =" Red "assert g = =" Green "assert b = =" Blue "  # Assign values from a tuple (x, y) = (from) assert x = = 1assert y = = 2

  

    • Use the module

In the sibling directory open Python, enter execute the following statement

$ import Webspiderwebspider.spider ("http://baike.baidu.com", ' Yangcheng ', 1000)

  

2. Using the Scrapy framework

Installation

Environmental dependencies:

OpenSSL, LIBXML2

Installation method: Pip Install Pyopenssl lxml

$pip Install scrapy cat > myspider.py <<eofimport scrapyclass blogspider (scrapy. Spider):    name = ' Blogspider '    start_urls = [' http://blog.scrapinghub.com ']    def parse (self, Response): For        URLs in Response.css (' ul li A::attr ("href") '). Re (R '. */\d\d\d\d/\d\d/$ '):            yield scrapy. Request (Response.urljoin (URL), self.parse_titles)    def parse_titles (self, Response): for        Post_title in Response.css (' Div.entries > Ul > Li a::text '). Extract ():            yield {' title ': Post_title}eof scrapy Runspider myspider.py

  

Resources:

Https://jecvay.com/2014/09/python3-web-bug-series1.html

http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/

Http://www.jb51.net/article/65260.htm

http://scrapy.org/

Https://docs.python.org/3/tutorial/modules.html

Python3 Web Crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.