Python practice, web crawler (beginner)

Source: Internet
Author: User

I'm also looking at the Python version of the RCNN code, which comes with the practice of Python programming to write a small web crawler.

The process of crawling a Web page is the same as when the reader usually uses Internet Explorer to browse the Web. For example, you enter www.baidu.com this address in the address bar of your browser. The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then to explain, show. HTML is a markup language that tags content and parses and differentiates it. The function of the browser is to parse the acquired HTML code and then turn the original code into the page of the site that we see directly.

Uniform Resource Identifier (Universal Resource Identifier, URI), Uniform Resource Locator (Uniform Resource locator,uri), URL is a subset of the URI.

In general, the web crawler principle is very simple, is through you given a URL, starting from the URL to crawl, download each URL of the HTML code, according to what you want to crawl, observe the regularity of the HTML code, write the corresponding regular expression, the required content of the HTML code out, Save in the list, and according to the specific requirements to deal with the code deduction, which is the web crawler, is actually a number of regular web pages of the HTML code to handle the program. (Of course, this is simply a small reptile, for some large crawlers, you can set up a lot of threads to handle each of the URLs obtained each time). In order to implement the regular expression part of the content, you should import the re package, to implement the URL loading, reading function needs to import URLLIB2 package.

Show page code:

= Urllib2.urlopen ('http://www.baidu.com/'= response.read () print HTML

Of course, in the process of requesting the Server service, an exception is also generated: Urlerror occurs without a network connection (no routing to a particular server), or when the server does not exist.

The HTML code for the Web page is processed:

Import urllib2import redef getimg (HTML):     = R'src= "(. +?\.jpg)" Pic_ext'    = re.compile (reg)     = Re.findall (imgre,html)             return imglist

The above code finds the URLs of all the pictures in the parameter HTML page, saves them in the list, and then returns the entire list. The results of the program execution are as follows:

The whole article is relatively low-level, but also hope you crossing generous enlighten. In addition to the basic methods used in the program, there is a more powerful Python crawler toolkit scrapy.

Python practice, web crawler (beginner)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.