Python is a powerful computer programming language. It can also be seen as an object-oriented general language. It has outstanding features and greatly facilitates the application of developers. Here, let's take a look at the Python city and county web crawler methods.
Today, I saw a webpage, and it was very troublesome to read it online because I used a telephone line to access the internet at home. So I wrote a simple program to capture the web page and read it offline, saving the cost of telephone. :) this program has only one layer of structure because the pages linked to the home page are in the same directory. Therefore, some hard-coded links are written for analysis.
The Python web crawler code is as follows:
- #! /Usr/bin/env python
- #-*-Coding: GBK -*-
- Import urllib
- From sgmllib import SGMLParser
- Class URLLister (SGMLParser ):
- Def reset (self ):
- SGMLParser. reset (self)
- Self. urls = []
- Def start_a (self, attrs ):
- Href = [v for k, v in attrs if k = 'href ']
- If href:
- Self. urls. extend (href)
- Url = r 'HTTP: // www.sinc.sunysb.edu/Clubs/buddhism/JinGangJingShuoShenMo /'
- Sock = urllib. urlopen (url)
- HtmlSource = sock. read ()
- Sock. close ()
- # Print htmlSource
- F = file('jingangjing.html ', 'w ')
- F. write (htmlSource)
- F. close ()
- Mypath = r 'HTTP: // www.sinc.sunysb.edu/Clubs/buddhism/JinGangJingShuoShenMo /'
- Parser = URLLister ()
- Parser. feed (htmlSource)
- For url in parser. urls:
- Myurl = mypath + url
- Print "get:" + myurl
- Sock2 = urllib. urlopen (myurl)
- Html2 = sock2.read ()
- Sock2.close ()
- # Save to file
- Print "save as:" + url
- F2 = file (url, 'w ')
- F2.write (html2)
- F2.close ()
The above describes how to implement web crawler in Python.