First, the definition of web crawler
The web crawler, the spider, is a very vivid name.
The internet is likened to a spider's web, so spiders are crawling around the web.
Web spiders are looking for Web pages through the URL of a Web page.
From one page of the site (usually the homepage), read the contents of the Web page, find the other link address in the page,
Then look for the next page through these link addresses, and keep looping until all pages of the site have been crawled.
If the entire Internet as a Web site, then the network spider can use this principle to the Internet all the pages are crawled down.
In this way, the web crawler is a crawling program, a crawl Web page program.
The basic operation of web crawler is to crawl Web pages.
So how do you get the page you want?
Let's start with the URL first.
Second, the process of browsing the web
The process of crawling a Web page is the same as when the reader usually uses Internet Explorer to browse the Web.
For example, you enter www.baidu.com this address in the address bar of your browser.
The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then to explain, show.
HTML is a markup language that tags content and parses and differentiates it.
The function of the browser is to parse the acquired HTML code and then turn the original code into the page of the site that we see directly.
Iii. concepts and examples of URIs and URLs
In simple terms, the URL is the http://www.baidu.com string that is entered on the browser side.
Before you understand the URL, first understand the concept of the URI.
What is a URI?
Each available resource on the Web, such as HTML documents, images, video clips, programs, and so on, is located by a common resource identifier (Universal Resource Identifier, URI).
URIs are usually made up of three parts:
① the naming mechanism for accessing resources;
② the host name of the storage resource;
The name of the ③ resource itself, represented by the path.
such as the following URI:
http://www.why.com.cn/myhtml/html1223/
We can explain it this way:
① This is a resource that can be accessed through the HTTP protocol,
② is located on the host www.webmonkey.com.cn,
③ is accessed through the path "/HTML/HTML40".
Iv. understanding and examples of URLs
A URL is a subset of the URI. It is the abbreviation of Uniform Resource Locator, translated as "Uniform Resource Locator".
In layman's words, URLs are strings that describe information resources on the Internet and are used primarily on various WWW client programs and server programs.
URLs can be used in a unified format to describe various information resources, including files, server addresses and directories.
The general format of the URL is (optional with square brackets []):
Protocol://hostname[:p ORT]/path/[;p Arameters][?query] #fragment
The format of the URL consists of three parts:
① The first part is the protocol (or service mode).
② the second part is the host IP address (and sometimes the port number) where the resource is stored.
③ The third part is the specific address of the host resource, such as directory and file name.
The first part and the second part are separated by the "://" Symbol,
The second and third sections are separated by a "/" symbol.
The first part and the second part are indispensable, and the third part can be omitted sometimes.
V. Simple comparison of URLs and URIs
A URI is a lower-level abstraction of a URL, a string literal standard.
In other words, the URI belongs to the parent class, and the URL belongs to the subclass of the URI. A URL is a subset of the URI.
The definition of URI is: Uniform Resource Identifier;
The URL is defined as: Uniform Resource Locator.
The difference is that the URI represents the path to the requesting server and defines such a resource.
The URL also shows how to access the resource (http://).
Let's look at a small example of two URLs below.
Example URL for 1.HTTP protocol:
Use Hypertext Transfer Protocol HTTP to provide resources for hypertext information Services.
Example: http://www.peopledaily.com.cn/channel/welcome.htm
Its computer domain name is www.peopledaily.com.cn.
The hypertext file (the file type is. html) is the welcome.htm under the directory/channel.
This is a computer of the People's Daily in China.
Example: http://www.rol.cn.NET/talk/talk1.htm
Its computer domain name is www.rol.cn.Net.
The hypertext file (the file type is. html) is the talk1.htm under the directory/talk.
This is the address of the chat room, which can enter the 1th room of the chat room.
2. The URL of the file
When a file is represented by a URL, the server is represented by a filename, followed by information such as the host IP address, the access path (that is, the directory), and the file name.
Directories and file names can sometimes be omitted, but the "/" symbol cannot be omitted.
Example: File://ftp.yoyodyne.com/pub/files/foobar.txt
The above URL represents a file stored in the pub/files/directory on the host ftp.yoyodyne.com, and the filename is foobar.txt.
Example: File://ftp.yoyodyne.com/pub
Represents the directory/pub on the host ftp.yoyodyne.com.
Example: file://ftp.yoyodyne.com/
Represents the root directory of the host ftp.yoyodyne.com.
The main object of the crawler is the URL, which obtains the required file content according to the URL address, and then carries on the further processing to it.
Therefore, an accurate understanding of URLs is critical to understanding web crawlers.
The above is [Python] web crawler (a): crawl the meaning of the page and the basic content of the URL, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!