Python's first Web Crawler

Source: Internet
Author: User

Python's first Web Crawler

Recently I want to get started with Python. The method for getting started with a language is to write a Demo. Python Demo must be a crawler. The first small crawler is a little simple, so do not spray it.

Crawlers are divided into three parts: crawling the interface based on the URL in the queue, obtaining the content, and saving the results.

The program uses Baidu website Daquan as the seed URL. The URLs in the captured page are placed in the queue in sequence, and crawlers obtain new URLs from the URL queue to continue crawling outward.

#-*-Coding: UTF-8-*-import urllib2import reimport threadimport timeclass HTML_Spider: def _ init _ (self): self. url = [] # crawl the interface def GetPage (self, URL) based on the url in the queue: try: myResponce = urllib2.urlopen (url) myPage = myResponce. read () myUrl = re. findall ('href = "(. *?) "', MyPage, re. s) self. url. extend (myUrl); Metadata T: print U' the current URL is invalid 'mypage = ''' return myPage # Save the interface def SavePage (self, page) in HTML Format: if page! = '': # Name the file in the form of a timestamp. f = open(time.strftime(str(time.time(%,,time.localtime(time.time(%%%%%'.html ', 'W +') # solve the pagenama problem. write (page) f. close () # Keep URL queue def StartSpider (self): I = 1 while 1: if I = 1: url = u'http: // site.baidu.com/'else: url = self. url [I] I + = 1 print url page = self. getPage (url) self. savePage (page) # Program main Function print U' to start crawling the page: 'raw_input ("") mySpider = HTML_Spider () mySpider. startSpider ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.