Write one of the simplest web crawlers in Python

Source: Internet
Author: User

What is a web crawler? This is the explanation of Baidu Encyclopedia:

Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the web-chaser), is a rule according to certain rules,
A program or script that automatically crawls world wide Web information. Other infrequently used names are ants, auto-indexing, simulation programs, or worms.

What can a reptile do? Crawlers can help us crawl the specific data we need in the vast Internet, which can be any data we want to get.

Crawler is a hot topic, because when you write a crawler, you will feel that you are doing a very NB, and whenever writing a crawler, will continue to try to write more NB on this basis, with NB Crawler, you can make more NB things, and to continue to move forward, this is the charm of reptiles.

After learning Python, the first program written in Python is a crawler, one of the simplest crawlers, that can crawl all the images in a Web page.

Import Urllib
Import re

html = Urllib.urlopen (' http://tieba.baidu.com/p/2460150866 '). Read ()
Reg = R' src= "(. +?\.jpg)" Pic_ext "
Img_re = Re.compile (reg)
Img_list = Re.findall (img_re,html)
x = 0
for Img_url in img_list:
Urllib.urlretrieve (Img_url,"d:\\python\\picture\\%s.jpg" % x)

Results after the run:

This crawler is simple enough to have only 10 lines of code, but it can download all the images in http://tieba.baidu.com/p/2460150866 quickly. This simple crawler can only take a picture of a webpage to download, and I want to write a crawler can be the entire site to download the picture, or, I write a crawler I just enter a variable: URL, it can be the entire site to download all the pictures. Words, not to do, afraid of no idea, then write the crawler is to continue to improve on this basis. If you have research on the Web page, you will find that there are many kinds of web pages, there are static pages, dynamic Web pages, crawling crawl static Web page is easy, and crawling Dynamic Web page is a bit difficult, there are different Web page structure is not the same, crawling method is not the same, to overcome these difficulties, you must make their own crawler more powerful

The same program, if it is written now, I will write another form, functional programming, all the code except the global variables are written in the function or class, so that the code will be neat, not error-prone, and the program error can quickly find the wrong location.

Import Urllib
Import re

URL =' http://tieba.baidu.com/p/2460150866 '

def gethtml (URL):
html = urllib.urlopen (URL). Read ()
returnHtml

def getImage (HTML):
Reg = R' src= ' (. +?\.jpg) "Pic_ext "
Img_re = Re.compile (reg)
Img_list = Re.findall (img_re,html)
returnImg_list

def download (img_list):
x = 0
forImg_url in Img_list:
Urllib.urlretrieve (Img_url,"D:\\python\\picture\\%s.jpg"% x)
X+=1

def main ():
html = gethtml (URL)
Img_list = getImage (HTML)
Download (img_list)

if__name__ = =' __main__ ':
Main ()

if __name__ = = ' __main__ ': means that the main function will only be executed within this program and will not be imported into other programs.

Write one of the simplest web crawlers in Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.