I'm also looking at the Python version of the RCNN code, which comes with the practice of Python programming to write a small web crawler.
The process of crawling a Web page is the same as when the reader usually uses Internet Explorer to browse the Web. For example, you enter www.baidu.com this address in the address bar of your browser. The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then to explain, show. HTML is a markup language that tags content and parses and differentiates it. The function of the browser is to parse the acquired HTML code and then turn the original code into the page of the site that we see directly.
Uniform Resource Identifier (Universal Resource Identifier, URI), Uniform Resource Locator (Uniform Resource locator,uri), URL is a subset of the URI.
In general, the web crawler principle is very simple, is through you given a URL, starting from the URL to crawl, download each URL of the HTML code, according to what you want to crawl, observe the regularity of the HTML code, write the corresponding regular expression, the required content of the HTML code out, Save in the list, and according to the specific requirements to deal with the code deduction, which is the web crawler, is actually a number of regular web pages of the HTML code to handle the program. (Of course, this is simply a small reptile, for some large crawlers, you can set up a lot of threads to handle each of the URLs obtained each time). In order to implement the regular expression part of the content, you should import the re package, to implement the URL loading, reading function needs to import URLLIB2 package.
Show page code:
= Urllib2.urlopen ('http://www.baidu.com/'= response.read () print HTML
Of course, in the process of requesting the Server service, an exception is also generated: Urlerror occurs without a network connection (no routing to a particular server), or when the server does not exist.
The HTML code for the Web page is processed:
Import urllib2import redef getimg (HTML): = R'src= "(. +?\.jpg)" Pic_ext' = re.compile (reg) = Re.findall (imgre,html) return imglist
The above code finds the URLs of all the pictures in the parameter HTML page, saves them in the list, and then returns the entire list. The results of the program execution are as follows:
The whole article is relatively low-level, but also hope you crossing generous enlighten. In addition to the basic methods used in the program, there is a more powerful Python crawler toolkit scrapy.
Python practice, web crawler (beginner)