Python exercises, web crawlers (beginner), and python exercises Crawlers

Source: Internet
Author: User

Python exercises, web crawlers (beginner), and python exercises Crawlers

Recently, I am still reading the Python version of rcnn code, with a small web crawler program for Python programming.

In fact, the process of capturing a webpage is the same as that of browsing a webpage through the IE browser. For example, enter www.baidu.com in the address bar of your browser. The process of opening a webpage is actually that the browser acts as a browser "client" and sends a request to the server, "grabbing" the server file, and then explaining and presenting it. HTML is a markup language that uses tags to tag content and parse and distinguish it. The function of the browser is to parse the obtained HTML code, and then convert the original code into a website page that we can directly see.

The Universal Resource Identifier (URI) and the Uniform Resource Locator (URI) are unified. The URL is a subset of the URI.

In general, the principle of web crawler is very simple, that is, through a URL you have given in advance, crawling from this URL, downloading the HTML code of each URL, according to the content you want to crawl, observe the regularity of the HTML code, write the corresponding regular expression, extract the HTML code of the required content, save it in the list, and process the deducted code as required, this is a Web Crawler. It is actually a program that processes the HTML code of several regular webpages. (Of course, this is just a simple small crawler. For some large crawlers, you can set a lot of threads to process each obtained URL address separately ). To implement partial content of the regular expression, you must import the re package. to load the URL, you must import the urllib2 package for reading the function.

Display webpage code:

import urllib2response = urllib2.urlopen('http://www.baidu.com/')html = response.read()print html

Of course, an exception occurs when you request server services: URLError is generated when there is no network connection (no route to a specific server) or the server does not exist.

Process the HTML code of a webpage:

import urllib2import redef getimg(html):    reg = r'src="(.+?\.jpg)" pic_ext'    imgre = re.compile(reg)     imglist = re.findall(imgre,html)            return imglist

The code above finds the URLs of all images on the parameter HTML page, saves them in the list, and then returns the entire list. The program execution result is as follows:

 

The entire article is relatively low-level. I hope you will not be able to give me any further comments. In addition to the basic methods used in the program, there is also a more powerful Python crawler toolkit Scrapy.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.