What is a web crawler? This is the explanation of Baidu Encyclopedia:
Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the web-chaser), is a rule according to certain rules,
A program or script that automatically crawls world wide Web information. Other infrequently used names are ants, auto-indexing, simulation programs, or worms.
What can a reptile do? Crawlers can help us crawl the specific data we need in the vast Internet, which can be any data we want to get.
Crawler is a hot topic, because when you write a crawler, you will feel that you are doing a very NB, and whenever writing a crawler, will continue to try to write more NB on this basis, with NB Crawler, you can make more NB things, and to continue to move forward, this is the charm of reptiles.
After learning Python, the first program written in Python is a crawler, one of the simplest crawlers, that can crawl all the images in a Web page.
Import Urllib
Import re
html = Urllib.urlopen (' http://tieba.baidu.com/p/2460150866 '). Read ()
Reg = R' src= "(. +?\.jpg)" Pic_ext "
Img_re = Re.compile (reg)
Img_list = Re.findall (img_re,html)
x = 0
for Img_url in img_list:
Urllib.urlretrieve (Img_url,"d:\\python\\picture\\%s.jpg" % x)
Results after the run:
This crawler is simple enough to have only 10 lines of code, but it can download all the images in http://tieba.baidu.com/p/2460150866 quickly. This simple crawler can only take a picture of a webpage to download, and I want to write a crawler can be the entire site to download the picture, or, I write a crawler I just enter a variable: URL, it can be the entire site to download all the pictures. Words, not to do, afraid of no idea, then write the crawler is to continue to improve on this basis. If you have research on the Web page, you will find that there are many kinds of web pages, there are static pages, dynamic Web pages, crawling crawl static Web page is easy, and crawling Dynamic Web page is a bit difficult, there are different Web page structure is not the same, crawling method is not the same, to overcome these difficulties, you must make their own crawler more powerful
The same program, if it is written now, I will write another form, functional programming, all the code except the global variables are written in the function or class, so that the code will be neat, not error-prone, and the program error can quickly find the wrong location.
Import Urllib
Import re
URL =' http://tieba.baidu.com/p/2460150866 '
def gethtml (URL):
html = urllib.urlopen (URL). Read ()
returnHtml
def getImage (HTML):
Reg = R' src= ' (. +?\.jpg) "Pic_ext "
Img_re = Re.compile (reg)
Img_list = Re.findall (img_re,html)
returnImg_list
def download (img_list):
x = 0
forImg_url in Img_list:
Urllib.urlretrieve (Img_url,"D:\\python\\picture\\%s.jpg"% x)
X+=1
def main ():
html = gethtml (URL)
Img_list = getImage (HTML)
Download (img_list)
if__name__ = =' __main__ ':
Main ()
if __name__ = = ' __main__ ': means that the main function will only be executed within this program and will not be imported into other programs.
Write one of the simplest web crawlers in Python