Python Toy Story crawler (1): urllib

Source: Internet
Author: User

Base:

Created Monday 17 march2014

What is crawler?


Wiki explanation: http://en.wikipedia.org/wiki/Web_spider
By referencing the interpretation on WIKI, crawlers say that a program (nonsense...) is the first one. What can this program do? He can capture webpages and save the data. On the Internet, there is no technical content for web page capturing, right? The key difficulty is to analyze and extract the data you want from the web page. Just as the weak water three thousand only takes one Piao .. one of Baidu's Google search engines and other technologies is a very good crawler.

How crawlers work?

I can only rely on the Blog of the experts to learn this kind of theoretical things. Here I recommend several websites. Let's get started with me.
Basic Principles of Web Crawler

Let's get started.

The following code uses Python to implement a simple Web Crawler. The omnipotent Python Standrad Lib can help me ~
In the standard library, python provides us with the urllib module, and then we use the urllib module to develop such a program:

Version 1.0: capture and print the webpage of any Website:
import urlliburl = "http://www.csdn.net"print urllib.urlopen(url).read()



No error. You are mistaken. You have already written a simple crawler. You have successfully captured the CSDN homepage and printed it to terminal. Of course, you have come back to the line of terminal, can't see clearly, right?
Well, we need to update the program. Haha, upgrade V1.0 to V1.1. The new function is to input the captured webpage to the file and save it as HTML, and then open the browser.

Version 1.1: Save to a file
#-*-Coding: utf-8import urllib # import the python standard library to provide us with urlliburl = "http://www.hao123.com" # This is the web address we want to crawl, be sure not to forget to have http: // This symbol is tmp_file = open ("/home/yg/Code/Python/tmp.html", "w") # open the file, the parameter passed in is your path with the file name tmp_file.write (urllib. urlopen (url ). read () # urllib. urlopen () You can check the document, which means to wear the url, a file-like object is returnedtmp_file.close () # flush the toilet, close the file to open it # open the file in the browser to see


Through V1.1, we can capture webpages and store them in files. But do we need to open file close file every time? Is there any faster way? Yes. This time we updated to V1.2.

Version 1.2: It can be saved in files in another way.
Import urlliburl = "http://www.hao123.com" filename = ". /test.html "urllib. urlretrieve (url, filename) # urllib. urlretrieve () the first parameter is the url. The second parameter is the file path and file name. # It directly stores the webpage to a local file.






Through v1.2, we already know two parameters of urllib. retrieve (). Let's use the third parameter urllib. urlretrieve () in update our version.
Version 1.3: this time, we will add a page download progress display.
#-*-Coding: UTF-8-*-import urlliburl = "http://www.hao123.com" filename = ". /test.html "def reporthook (count, block_size, total_size ): "The actual callback function download progress @ count the number of downloaded data blocks is the number of OH @ block_size the size of the data, generally, how many bytes are there? @ totol_size: The total file size can be calculated based on the three values. As long as totol_size is equal to count * block_size, the download is complete. "" per = (100.0 * count * block_size) /total_size # per indicates the percentage progress of print "Download Percent: %. 2f % "% perdown_log = urllib. urlretrieve (url, filename, reporthook) # reporthook is a callback function, that is, you pass the function name (actually a pointer to the function, an address ), then it calls your function back and transmits the three parameters print down_log.




Version 1.4: Can I download only webpages? Can I download an object? Let's give it a try.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.