1. What is a reptile?
Crawler, or web crawler, you can understand as a spider crawling on the internet, the Internet is likened to a large network, and the crawler is crawling on this web spider, if it encounters resources, then it will crawl down. If you want to grab something, you control it by yourself.
For example, it is crawling a Web page, in which it finds a path, in fact, a hyperlink to a Web page, then it can crawl another web to get data. In this way, the whole network connected to the spider is within reach, crawling down is not difficult.
2. The process of browsing the web
In the process of users to browse the Web page, we may see a lot of photos, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, the process is actually user input URL, after the DNS server, find the server host, Send a request to the server, the server after parsing, sent to the user browser HTML, JS, CSS and other files, the browser parse out, the user can see all kinds of pictures.
Therefore, the user to see the Web page is essentially composed of HTML code, crawler crawling is these content, through the analysis and filtering of these HTML code, to achieve the image, text and other resources.
Concepts and examples of 3.URI and URLs
In simple terms, the URL is the http://www.baidu.com string that is entered on the browser side.
Before you know the URL, first understand the concept of the URL. Each available resource on the Web, such as HTML documents, images, video clips, programs, and so on, is located by a common resource identifier (Universal Resource identifier,url).
URIs are usually made up of three parts
① the naming mechanism for accessing resources;
② the host name of the storage resource;
The name of the ③ resource itself, represented by the path.
such as the following URI:
HTTP://WWW.WHY.COM.CN/MYHTML/HTML1223/,
We can explain it this way:
① This is a resource that can be accessed through the HTTP protocol;
② is located on the host www.why.com.cn;
③ access by Path "/myhtml/html1223"
4.URL of understanding and examples
A URL is a subset of the URI. It is the abbreviation of Uniform Resource Locator, which is translated into "Uniform Resource Locator". In layman's words, URLs are strings that describe information resources on the Internet and are used primarily on various WWW client programs and server programs. URLs can be used to describe various information resources in a uniform format, including files, server addresses and directories.
The general format of the URL is (optional with square brackets []):
Protocol://hostname[:p ORT]/path/[;p Arameters][?query] #fragment
The format of the URL is made up of three parts
① The first part is the agreement (or service mode);
② the second part is the host IP address (and sometimes the port number) where the resource is stored;
③ The third part is the specific address of the host resource, such as directory and file name.
The first part and the second part are separated by the "://" Symbol,
The second and third sections are separated by a "/" symbol
The first part and the second part are indispensable, the third part can sometimes omit
Simple comparison of 5.URL and URIs
A URI is a lower-level abstraction of a URL, a string literal standard. In other words, the URI belongs to the parent class, and the URL belongs to the subclass of the URI.
Definition of URI: Universal Resource Identifier, Uniform Resource Identifier;
URL definition: Uniform Resource Locator, Uniform Resource Locator.
The difference between the two is that the URI represents the path to the requesting server and defines such a resource. The URL also shows how to access the resource (HTTP//)
Here's a look at a small example of two URLs
(1) URL Example for HTTP protocol
Example: http://www.peopledaily.com.cn/channel/welcome.htm
① uses Hypertext Transfer Protocol HTTP to provide hyper-text information service resources
② its computer domain name is www.peopledaily.com.cn
③ hypertext files (. htm file type) are welcome.htm under the directory channel
(2) URL example of a file
When a file is represented by a URL, the server is represented by a filename, followed by information such as the host IP address, the access path (that is, the directory), and the file name. Directories and file names can sometimes be omitted, but the "/" symbol cannot be omitted
Example: File://ftp.yoyodyne.com/pub/files/foobar.txt
① This is a file
② a file stored in the pub/files/directory on the host ftp.yoyodyne.com, the filename is foobar.txt
Example: File://ftp.yoyodyne.com/pub
Represents the directory/pub on the host ftp.yoyodyne.com.
Example: file://ftp.yoyodyne.com/
Represents the root directory of the host ftp.yoyodyne.com.
The main object of the crawler is the URL, which obtains the required file content according to the URL address, and then carries on the further processing to it.
Therefore, an accurate understanding of URLs is critical to understanding web crawlers.
4.Python Crawler Primer II Crawler Basics Learn