The definition of web crawler
Network crawler, Web Spider, is a very image of the name.
The internet is likened to a spider web, so spider is the spider crawling up and down the Internet.
Web spiders look for Web pages by their link addresses.
From a Web page (usually the homepage), read the content of the page, find the other link address in the page,
Then look for the next page through these link addresses, and keep looping until all the Web pages are crawled.
If the entire Internet as a Web site, then the Web spider can use this principle of the Internet all the pages are crawled down.
In this way, the web crawler is a crawling program, a crawl Web page program.
The basic operation of web crawler is to crawl Web pages.
So how do you get the page you want?
Let's start with the URL first.
Second, the process of browsing the web
The process of crawling a webpage is actually the same as the reader's browsing the Web page using IE's browser.
For example, you enter www.baidu.com this address in the address bar of the browser.
The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then explain, show.
HTML is a markup language that uses tags to mark content and parse and differentiate it.
The browser's function is to parse the acquired HTML code and then turn the original code into the site page we see directly.
Iii. concepts and examples of URIs
Simply put, the URL is the www.baidu.com string that is entered at the browser end.
Before you understand URLs, you first need to understand the concept of URIs.
What is a URI?
Each available resource on the Web, such as HTML documents, images, video clips, programs, and so on, is positioned by a generic resource identifier (Universal Resource Identifier, URI).
A URI is usually made up of three parts:
① the naming mechanism for accessing resources;
② host name for storing resources;
The name of the ③ resource itself, represented by the path.
As the following URI:
http://www.why.com.cn/myhtml/html1223/
We can explain it this way:
① This is a resource that can be accessed through the HTTP protocol,
② is located on the host www.webmonkey.com.cn,
③ access through the path "/HTML/HTML40".