1. Traverse a single domain name
Wikipedia links that point to the entry page (not to other content pages) have three things in common:
? They're all in the div tag with the ID bodycontent.
? URL link does not contain a semicolon
? URL links start with/wiki/
#-*-Coding:utf-8-*-import refrom urllib.request import urlopenfrom bs4 Import beautifulsouphtml = Urlopen ("HTTP://EN.W Ikipedia.org/wiki/kevin_bacon ") Bsobj = BeautifulSoup (HTML," lxml ") for link in Bsobj.find (" div ", {" id ":" bodycontent "}) . FindAll ("A", Href=re.compile ("^" ^ (/wiki/) (?!:).) *$ ")): if ' href ' in link.attrs:print (link.attrs[' href '))
Run the above code and you'll see a link to all the other terms in the Kevin Bacon entry on Wikipedia.
Simply build a crawler from one page to another:
#-*-Coding:utf-8-*-import reimport datetimeimport randomfrom urllib.request import urlopenfrom BS4 import BeautifulSou p# generates a random number generator with the current time of the system Random.seed (Datetime.datetime.now ()) def getlinks (articleurl): HTML = Urlopen ("/HTTP/ en.wikipedia.org "+articleurl) Bsobj = BeautifulSoup (HTML," lxml ") return Bsobj.find (" div ", {" id ":" bodycontent "}). Fin DAll ("A", Href=re.compile ("^ (/wiki/) (?!:).) *$ ") links = getlinks ("/wiki/kevin_bacon ") while Len (links) > 0:newarticle = Links[random.randint (0, Len (links)-1)]. attrs["href"] print (newarticle) links = getlinks (newarticle)
The program first sets the list of entry links in the Start page to a list of links. Then use a loop, randomly find an entry link tag from the page and extract the href attribute, print the page link, and then pass the link to the Getlinks function to retrieve the new link list.
2. Collect the entire website
The first thing to do is to link to the weight, to avoid a page is repeated collection.
We can then print out the page title, the first paragraph in the body, and the link to the edit page (if any).
# -*- coding: utf-8 -*-import refrom urllib.request import urlopenfrom bs4 import beautifulsouppages = set () def getlinks (PAGEURL): global pages html = urlopen ("http://en.wikipedia.org" +pageurl) bsobj = beautifulsoup (html, "lxml") try: print (BsObj.h1.get_text ()) print (Bsobj.find (id= "Mw-content-text"). FindAll ("P") [0]) Print (Bsobj.find (id= "Ca-edit"). Find ("span"). FIND ("a"). attrs[' href ') except Attributeerror: print ("The page is missing some properties! But don't worry! ") for link in Bsobj.findall ("A", href=re.compile ("^ (/wiki/)")): if ' href ' in link.attrs: if link.attrs[' href '] not in pages: # we ran into a new page newpage = link.attrs[' href '] print ("------- ---------\ n "+newpage) pages.add (newPage) getlinks (newPage) getlinks ("")
The above program starts with getlinks processing an empty URL (actually Wikipedia's homepage). It then prints out the information that needs to be output, then iterates through each link on the page and checks to see if it is already in the global Variables collection pages (the collection of pages that have been collected). If not, print to the screen, add the link to the Pages collection, and then use Getlinks to handle the link recursively.
3, through the Internet collection
# -*- coding: utf-8 -*-import reimport datetimeimport randomfrom Urllib.request import urlopenfrom bs4 import beautifulsouppages = set () Random.seed (Datetime.datetime.now ()) # get a list of all the links within the page def getinternallinks (bsobj, includeurl): internalLinks = [] # Find all links that start with "/" for link in bsobj.findall ("A", href=re.compile ("^" (/|). * "+includeurl+")): if link.attrs[' href '] is Not none: if link.attrs[' href '] not in internallinks: internallinks.append (link.attrs[' href ') return internallinks # get a list of all the outside chains of the page def Getexternallinks (Bsobj, excludeurl): externallinks = [] # Find all links that start with "http" or "www" and do not contain the current URL for link in Bsobj.findall ("A", href=re.compile ("^ (http|www)" (?!) +excludeurl+ ").) *$ ")): if link.attrs[' href '] is not none: if link.attrs[' href '] not in externalLinks: externallinks.append (link.attrs[' href ') return externallinks def splitaddress (address): addressparts = Address.replace ("http://", ""). Split ("/") return addressParts def getrandomexternallink (startingpage): hTml = urlopen (startingpage) bsobj = beautifulsoup (html, "lxml" ) externallinks = getexternallinks (Bsobj, splitaddress (StartingPage) [0]) if len (externallinks) == 0: internallinks = getinternallinks (startingpage) Return getnextexternallink (Internallinks[random.randint (0, len (internallinks)-1)) else: return externallinks[random.randint (0, Len (externallinks)-1)] def followexternalonly (startingSite ): externallink = getrandomexternallink ("http://oreilly.com") print ("Random Outside chain is:" +externallink) followexternalonly (externallink) followexTernalonly ("http://oreilly.com")
The above program starts from http://oreilly.com and then randomly jumps from one outer chain to another.
There is no guarantee that the chain will always be found on the homepage of the website. In order to be able to find out the chain, you need to recursively go deep into a website until you find an outside chain to stop. If the crawler encounters a Web site inside a chain is not, then the program will continue to run in this site can not jump out until the recursion to reach the limits of Python.
If our goal is to collect all the external chains of a website, and record each of the outer chains, you can add the following function:
Allextlinks = set () Allintlinks = set () def getallexternallinks (SITEURL): html = urlopen (SiteURL) bsobj = beautifulsoup (HTML, ' lxml ') internallinks = getinternallinks (bsobj,splitaddress (SITEURL) [0] ) externallinks = getexternallinks (Bsobj,splitaddress (SITEURL) [0]) for link in externallinks: if link not in allExtLinks: allextlinks.add (link) print (link) for link in internalLinks: if link not in allIntLinks: print ("forthcomingThe URL to get the link is: "+link) allintlinks.add (link) getallexternallinks (link) Getallexternallinks ("http://oreilly.com")
4. Collect with Scrapy
Create a Scrapy project: A new project folder with the name Wikispider is created in the current directory.
Scrapy Startproject Wikispider
In the items.py file, define a article class.
#-*-Coding:utf-8-*-# Define Here the models for your scraped items## see documentation in:# from scrapy import Item, Fieldclass article (Item): # define the fields for your Item here is like: # name = Scrapy. Field () title = Field ()
Add a articlespider.py file to the wikispider/wikispider/spiders/folder.
from scrapy.selector import selectorfrom scrapy import spiderfrom Wikispider.items import articleclass articlespider (Spider): name = "article" allowed_domains = ["en.wikipedia.org"] start_urls = ["http://en.wikipedia.org/wiki/Main_Page", "http// En.wikipedia.org/wiki/python_%28programming_language%29 "] def parse (Self, response): item = article () title = response.xpath ('//h1/text () ') [0].extract () print ("title is: " + title) item[' title '] = title return item
Run Articlespider in the Wikispider home directory with the following command:
Scrapy Startproject Wikispider
The results of these two lines should be in the debug message that appears in succession:
Title Is:main pagetitle Is:python (programming language)
* The log display level can be set in the setting.py file in the Scrapy project:
Log_level = ' ERROR '
There are five levels of scrapy logs, which are listed in ascending order of range: Critical,error,warning,debug,info
You can also output (append) to a separate file:
Scrapy Crawl article-s log_file=wiki.logtitle is:main pagetitle Is:python (programming language)
Scrapy supports saving this information in a different output format, as shown in the following command:
Scrapy Crawl Article-o articles.csv-t csvscrapy crawl article-o articles.json-t jsonscrapy crawl Article-o articles.x ML-T XML
You can also customize the item object to write the results to a file or database that you need, as long as you add the appropriate code to the parse section of the crawler.
"Python Network data Acquisition" Reading notes (iv)