Python web crawler implementation code
First, let's look at a Python library for capturing web pages: urllib or urllib2.
What is the difference between urllib and urllib2?
You can use urllib2 as the extension of urllib. The obvious advantage is that urllib2.urlopen () can accept the Request object as a parameter, thus controlling the header of the HTTP Request.
Urllib2 library should be used whenever possible during HTTP Request, but urllib. urlretrieve () function and urllib. quote and other quote and unquote functions are not added to urllib2. Therefore, urllib is also required.
Urllib. open () the input parameters must follow some protocols, such as http, ftp, and file. For example:
Urllib. open ('HTTP: // www.baidu.com ')
Urllib. open ('file: D \ Python \ Hello. py ')
Here is an example of downloading all gif images on a website. The Python code is as follows:
import reimport urllibdef getHtml(url): page = urllib.urlopen(url) html = page.read() return htmldef getImg(html): reg = r'src="(.*?\.gif)"' imgre = re.compile(reg) imgList = re.findall(imgre,html) print imgList cnt = 1 for imgurl in imgList: urllib.urlretrieve(imgurl,'%s.jpg' %cnt) cnt += 1if __name__ == '__main__': html = getHtml('http://www.baidu.com') getImg(html)
Based on the above method, we can capture certain webpages and then extract the data we need.
In fact, the efficiency of using the urllib module for Web crawler is extremely low. Next we will introduce Tornado Web Server.
Tornado web server is a lightweight, highly scalable, and non-blocking I/O Web server software written in Python. The famous Friendfeed website is built with it. Unlike other mainstream Web Server frameworks (mainly Python frameworks), Tornado uses epoll for non-blocking I/O and responds quickly. It can process thousands of concurrent connections and is especially suitable for real-time Web services.
Using Tornado Web Server to capture webpages is more efficient.
From Tornado's official website, we also need to install backports. ssl_match_hostname. The official website is as follows:
Http://www.tornadoweb.org/en/stable/
import tornado.httpclientdef Fetch(url): http_header = {'User-Agent' : 'Chrome'} http_request = tornado.httpclient.HTTPRequest(url=url,method='GET',headers=http_header,connect_timeout=200,request_timeout=600) print 'Hello' http_client = tornado.httpclient.HTTPClient() print 'Hello World' print 'Start downloading data...' http_response = http_client.fetch(http_request) print 'Finish downloading data...' print http_response.code all_fields = http_response.headers.get_all() for field in all_fields: print field print http_response.bodyif __name__ == '__main__': Fetch('http://www.baidu.com')
Common urllib2 methods:
(1) info () obtains the Header information of the webpage.
(2) getcode () obtains the status code of the webpage.
(3) geturl () obtains the input URL
(4) read () read the file content
Articles you may be interested in:
- Example of using a python web crawler to collect Lenovo words
- Python uses rabbitmq to implement web crawler examples
- Use Python to write simple web crawlers to capture video download resources
- Using Python Pyspider as an example to analyze how to implement web crawler in search engines
- Web Crawler instance of Baidu post bar based on Python
- Detailed explanation of the basic writing of the Python web crawler Function