Python Toy Story crawler (1): urllib

Last Update:2014-03-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Base:

Created Monday 17 march2014

What is crawler?

Wiki explanation: http://en.wikipedia.org/wiki/Web_spider
By referencing the interpretation on WIKI, crawlers say that a program (nonsense...) is the first one. What can this program do? He can capture webpages and save the data. On the Internet, there is no technical content for web page capturing, right? The key difficulty is to analyze and extract the data you want from the web page. Just as the weak water three thousand only takes one Piao .. one of Baidu's Google search engines and other technologies is a very good crawler.

How crawlers work?

I can only rely on the Blog of the experts to learn this kind of theoretical things. Here I recommend several websites. Let's get started with me.
Basic Principles of Web Crawler

Let's get started.

The following code uses Python to implement a simple Web Crawler. The omnipotent Python Standrad Lib can help me ~
In the standard library, python provides us with the urllib module, and then we use the urllib module to develop such a program:

Version 1.0: capture and print the webpage of any Website:

import urlliburl = "http://www.csdn.net"print urllib.urlopen(url).read()

No error. You are mistaken. You have already written a simple crawler. You have successfully captured the CSDN homepage and printed it to terminal. Of course, you have come back to the line of terminal, can't see clearly, right?
Well, we need to update the program. Haha, upgrade V1.0 to V1.1. The new function is to input the captured webpage to the file and save it as HTML, and then open the browser.

Version 1.1: Save to a file

#-*-Coding: utf-8import urllib # import the python standard library to provide us with urlliburl = "http://www.hao123.com" # This is the web address we want to crawl, be sure not to forget to have http: // This symbol is tmp_file = open ("/home/yg/Code/Python/tmp.html", "w") # open the file, the parameter passed in is your path with the file name tmp_file.write (urllib. urlopen (url ). read () # urllib. urlopen () You can check the document, which means to wear the url, a file-like object is returnedtmp_file.close () # flush the toilet, close the file to open it # open the file in the browser to see

Through V1.1, we can capture webpages and store them in files. But do we need to open file close file every time? Is there any faster way? Yes. This time we updated to V1.2.

Version 1.2: It can be saved in files in another way.

Import urlliburl = "http://www.hao123.com" filename = ". /test.html "urllib. urlretrieve (url, filename) # urllib. urlretrieve () the first parameter is the url. The second parameter is the file path and file name. # It directly stores the webpage to a local file.

Through v1.2, we already know two parameters of urllib. retrieve (). Let's use the third parameter urllib. urlretrieve () in update our version.
Version 1.3: this time, we will add a page download progress display.

#-*-Coding: UTF-8-*-import urlliburl = "http://www.hao123.com" filename = ". /test.html "def reporthook (count, block_size, total_size ): "The actual callback function download progress @ count the number of downloaded data blocks is the number of OH @ block_size the size of the data, generally, how many bytes are there? @ totol_size: The total file size can be calculated based on the three values. As long as totol_size is equal to count * block_size, the download is complete. "" per = (100.0 * count * block_size) /total_size # per indicates the percentage progress of print "Download Percent: %. 2f % "% perdown_log = urllib. urlretrieve (url, filename, reporthook) # reporthook is a callback function, that is, you pass the function name (actually a pointer to the function, an address ), then it calls your function back and transmits the three parameters print down_log.

Version 1.4: Can I download only webpages? Can I download an object? Let's give it a try.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Toy Story crawler (1): urllib

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Toy Story crawler (1): urllib

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support