International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python practice, web crawler (beginner)

Last Update:2016-07-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I'm also looking at the Python version of the RCNN code, which comes with the practice of Python programming to write a small web crawler.

The process of crawling a Web page is the same as when the reader usually uses Internet Explorer to browse the Web. For example, you enter www.baidu.com this address in the address bar of your browser. The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then to explain, show. HTML is a markup language that tags content and parses and differentiates it. The function of the browser is to parse the acquired HTML code and then turn the original code into the page of the site that we see directly.

Uniform Resource Identifier (Universal Resource Identifier, URI), Uniform Resource Locator (Uniform Resource locator,uri), URL is a subset of the URI.

In general, the web crawler principle is very simple, is through you given a URL, starting from the URL to crawl, download each URL of the HTML code, according to what you want to crawl, observe the regularity of the HTML code, write the corresponding regular expression, the required content of the HTML code out, Save in the list, and according to the specific requirements to deal with the code deduction, which is the web crawler, is actually a number of regular web pages of the HTML code to handle the program. (Of course, this is simply a small reptile, for some large crawlers, you can set up a lot of threads to handle each of the URLs obtained each time). In order to implement the regular expression part of the content, you should import the re package, to implement the URL loading, reading function needs to import URLLIB2 package.

Show page code:

= Urllib2.urlopen ('http://www.baidu.com/'= response.read () print HTML

Of course, in the process of requesting the Server service, an exception is also generated: Urlerror occurs without a network connection (no routing to a particular server), or when the server does not exist.

The HTML code for the Web page is processed:

Import urllib2import redef getimg (HTML):     = R'src= "(. +?\.jpg)" Pic_ext'    = re.compile (reg)     = Re.findall (imgre,html)             return imglist

The above code finds the URLs of all the pictures in the parameter HTML page, saves them in the list, and then returns the entire list. The results of the program execution are as follows:

The whole article is relatively low-level, but also hope you crossing generous enlighten. In addition to the basic methods used in the program, there is a more powerful Python crawler toolkit scrapy.

Python practice, web crawler (beginner)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

The difference between OS and sys two modules in Python 04-05

Python: send emails 12-08

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python practice, web crawler (beginner)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support