Python's first Web Crawler
Recently I want to get started with Python. The method for getting started with a language is to write a Demo. Python Demo must be a crawler. The first small crawler is a little simple, so do not spray it.
Crawlers are divided into three parts: crawling the interface based on the URL in the queue, obtaining the content, and saving the results.
The program uses Baidu website Daquan as the seed URL. The URLs in the captured page are placed in the queue in sequence, and crawlers obtain new URLs from the URL queue to continue crawling outward.
#-*-Coding: UTF-8-*-import urllib2import reimport threadimport timeclass HTML_Spider: def _ init _ (self): self. url = [] # crawl the interface def GetPage (self, URL) based on the url in the queue: try: myResponce = urllib2.urlopen (url) myPage = myResponce. read () myUrl = re. findall ('href = "(. *?) "', MyPage, re. s) self. url. extend (myUrl); Metadata T: print U' the current URL is invalid 'mypage = ''' return myPage # Save the interface def SavePage (self, page) in HTML Format: if page! = '': # Name the file in the form of a timestamp. f = open(time.strftime(str(time.time(%,,time.localtime(time.time(%%%%%'.html ', 'W +') # solve the pagenama problem. write (page) f. close () # Keep URL queue def StartSpider (self): I = 1 while 1: if I = 1: url = u'http: // site.baidu.com/'else: url = self. url [I] I + = 1 print url page = self. getPage (url) self. savePage (page) # Program main Function print U' to start crawling the page: 'raw_input ("") mySpider = HTML_Spider () mySpider. startSpider ()