"Python Network data Acquisition" Reading notes (iv)

Source: Internet
Author: User

1. Traverse a single domain name

Wikipedia links that point to the entry page (not to other content pages) have three things in common:

? They're all in the div tag with the ID bodycontent.

? URL link does not contain a semicolon

? URL links start with/wiki/

#-*-Coding:utf-8-*-import refrom urllib.request import urlopenfrom bs4 Import beautifulsouphtml = Urlopen ("HTTP://EN.W Ikipedia.org/wiki/kevin_bacon ") Bsobj = BeautifulSoup (HTML," lxml ") for link in Bsobj.find (" div ", {" id ":" bodycontent "}) . FindAll ("A", Href=re.compile ("^" ^ (/wiki/) (?!:).) *$ ")): if ' href ' in link.attrs:print (link.attrs[' href '))

Run the above code and you'll see a link to all the other terms in the Kevin Bacon entry on Wikipedia.


Simply build a crawler from one page to another:

#-*-Coding:utf-8-*-import reimport datetimeimport randomfrom urllib.request import urlopenfrom BS4 import BeautifulSou p# generates a random number generator with the current time of the system Random.seed (Datetime.datetime.now ()) def getlinks (articleurl): HTML = Urlopen ("/HTTP/ en.wikipedia.org "+articleurl) Bsobj = BeautifulSoup (HTML," lxml ") return Bsobj.find (" div ", {" id ":" bodycontent "}). Fin DAll ("A", Href=re.compile ("^ (/wiki/) (?!:).) *$ ") links = getlinks ("/wiki/kevin_bacon ") while Len (links) > 0:newarticle = Links[random.randint (0, Len (links)-1)]. attrs["href"] print (newarticle) links = getlinks (newarticle)

The program first sets the list of entry links in the Start page to a list of links. Then use a loop, randomly find an entry link tag from the page and extract the href attribute, print the page link, and then pass the link to the Getlinks function to retrieve the new link list.


2. Collect the entire website

The first thing to do is to link to the weight, to avoid a page is repeated collection.

We can then print out the page title, the first paragraph in the body, and the link to the edit page (if any).

# -*- coding: utf-8 -*-import refrom urllib.request import urlopenfrom  bs4 import beautifulsouppages = set () def getlinks (PAGEURL):     global pages    html = urlopen ("http://en.wikipedia.org" +pageurl)     bsobj = beautifulsoup (html,  "lxml")     try:         print (BsObj.h1.get_text ())          print (Bsobj.find (id= "Mw-content-text"). FindAll ("P") [0])          Print (Bsobj.find (id= "Ca-edit"). Find ("span"). FIND ("a"). attrs[' href ')     except  Attributeerror:        print ("The page is missing some properties! But don't worry! ")             for link in  Bsobj.findall ("A",  href=re.compile ("^ (/wiki/)")):         if  ' href '  in link.attrs:             if link.attrs[' href '] not in pages:             #  we ran into a new page                  newpage = link.attrs[' href ']                 print ("------- ---------\ n "+newpage)                  pages.add (newPage)                  getlinks (newPage) getlinks ("")

The above program starts with getlinks processing an empty URL (actually Wikipedia's homepage). It then prints out the information that needs to be output, then iterates through each link on the page and checks to see if it is already in the global Variables collection pages (the collection of pages that have been collected). If not, print to the screen, add the link to the Pages collection, and then use Getlinks to handle the link recursively.


3, through the Internet collection

# -*- coding: utf-8 -*-import reimport datetimeimport randomfrom  Urllib.request import urlopenfrom bs4 import beautifulsouppages = set () Random.seed (Datetime.datetime.now ()) #  get a list of all the links within the page def getinternallinks (bsobj, includeurl):     internalLinks = []    #  Find all links that start with "/"      for link in bsobj.findall ("A",  href=re.compile ("^" (/|). * "+includeurl+")):         if link.attrs[' href '] is  Not none:            if link.attrs[' href '] not in internallinks:                 internallinks.append (link.attrs[' href ')     return  internallinks    #  get a list of all the outside chains of the page def  Getexternallinks (Bsobj, excludeurl):    externallinks = []     #  Find all links that start with "http" or "www" and do not contain the current URL     for link in  Bsobj.findall ("A",  href=re.compile ("^ (http|www)" (?!) +excludeurl+ ").) *$ ")):         if link.attrs[' href '] is not none:             if link.attrs[' href '] not  in externalLinks:                 externallinks.append (link.attrs[' href ')     return externallinks     def splitaddress (address):    addressparts =  Address.replace ("http://",  ""). Split ("/")     return addressParts     def getrandomexternallink (startingpage):     hTml = urlopen (startingpage)     bsobj = beautifulsoup (html,  "lxml" )     externallinks = getexternallinks (Bsobj, splitaddress (StartingPage) [0])     if len (externallinks)  == 0:         internallinks = getinternallinks (startingpage)          Return getnextexternallink (Internallinks[random.randint (0, len (internallinks)-1))      else:        return externallinks[random.randint (0,  Len (externallinks)-1)]        def followexternalonly (startingSite ):     externallink = getrandomexternallink ("http://oreilly.com")      print ("Random Outside chain is:" +externallink)     followexternalonly (externallink)      followexTernalonly ("http://oreilly.com") 

The above program starts from http://oreilly.com and then randomly jumps from one outer chain to another.

There is no guarantee that the chain will always be found on the homepage of the website. In order to be able to find out the chain, you need to recursively go deep into a website until you find an outside chain to stop. If the crawler encounters a Web site inside a chain is not, then the program will continue to run in this site can not jump out until the recursion to reach the limits of Python.


If our goal is to collect all the external chains of a website, and record each of the outer chains, you can add the following function:

Allextlinks = set () Allintlinks = set () def getallexternallinks (SITEURL):     html = urlopen (SiteURL)     bsobj = beautifulsoup (HTML,   ' lxml ')     internallinks = getinternallinks (bsobj,splitaddress (SITEURL) [0] )     externallinks = getexternallinks (Bsobj,splitaddress (SITEURL) [0])      for link in externallinks:        if  link not in allExtLinks:             allextlinks.add (link)             print (link)     for link in internalLinks:         if link not in allIntLinks:             print ("forthcomingThe URL to get the link is: "+link)             allintlinks.add (link)             getallexternallinks (link) Getallexternallinks ("http://oreilly.com")


4. Collect with Scrapy

Create a Scrapy project: A new project folder with the name Wikispider is created in the current directory.

Scrapy Startproject Wikispider


In the items.py file, define a article class.

#-*-Coding:utf-8-*-# Define Here the models for your scraped items## see documentation in:# from scrapy import Item, Fieldclass article (Item): # define the fields for your Item here is like: # name = Scrapy. Field () title = Field ()


Add a articlespider.py file to the wikispider/wikispider/spiders/folder.

from scrapy.selector import selectorfrom scrapy import spiderfrom  Wikispider.items import articleclass articlespider (Spider):         name =  "article"     allowed_domains = ["en.wikipedia.org"]     start_urls = ["http://en.wikipedia.org/wiki/Main_Page",  "http// En.wikipedia.org/wiki/python_%28programming_language%29 "]        def  parse (Self, response):         item = article ()         title = response.xpath ('//h1/text () ') [0].extract ()         print ("title is: "  + title)          item[' title '] = title         return item


Run Articlespider in the Wikispider home directory with the following command:

Scrapy Startproject Wikispider

The results of these two lines should be in the debug message that appears in succession:

Title Is:main pagetitle Is:python (programming language)


* The log display level can be set in the setting.py file in the Scrapy project:

Log_level = ' ERROR '

There are five levels of scrapy logs, which are listed in ascending order of range: Critical,error,warning,debug,info

You can also output (append) to a separate file:

Scrapy Crawl article-s log_file=wiki.logtitle is:main pagetitle Is:python (programming language)


Scrapy supports saving this information in a different output format, as shown in the following command:

Scrapy Crawl Article-o articles.csv-t csvscrapy crawl article-o articles.json-t jsonscrapy crawl Article-o articles.x ML-T XML

You can also customize the item object to write the results to a file or database that you need, as long as you add the appropriate code to the parse section of the crawler.


"Python Network data Acquisition" Reading notes (iv)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.