0 Basic Writing Python crawler's definition and URL composition _python

Source: Internet
Author: User

The definition of web crawler

Network crawler, Web Spider, is a very image of the name.
The internet is likened to a spider web, so spider is the spider crawling up and down the Internet.
Web spiders look for Web pages by their link addresses.
From a Web page (usually the homepage), read the content of the page, find the other link address in the page,
Then look for the next page through these link addresses, and keep looping until all the Web pages are crawled.
If the entire Internet as a Web site, then the Web spider can use this principle of the Internet all the pages are crawled down.
In this way, the web crawler is a crawling program, a crawl Web page program.
The basic operation of web crawler is to crawl Web pages.
So how do you get the page you want?
Let's start with the URL first.

Second, the process of browsing the web

The process of crawling a webpage is actually the same as the reader's browsing the Web page using IE's browser.
For example, you enter www.baidu.com this address in the address bar of the browser.
The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then explain, show.
HTML is a markup language that uses tags to mark content and parse and differentiate it.
The browser's function is to parse the acquired HTML code and then turn the original code into the site page we see directly.

Iii. concepts and examples of URIs and URLs

Simply put, the URL is the http://www.baidu.com string that is entered at the browser end.
Before you understand URLs, you first need to understand the concept of URIs.
What is a URI?
Each available resource on the Web, such as HTML documents, images, video clips, programs, and so on, is positioned by a generic resource identifier (Universal Resource Identifier, URI).
A URI is usually made up of three parts:
① the naming mechanism for accessing resources;
② host name for storing resources;
The name of the ③ resource itself, represented by the path.
As the following URI:
http://www.why.com.cn/myhtml/html1223/
We can explain it this way:
① This is a resource that can be accessed through the HTTP protocol,
② is located on the host www.webmonkey.com.cn,
③ access through the path "/HTML/HTML40".

Iv. understanding and examples of URLs

A URL is a subset of the URI. It is the abbreviation of Uniform Resource Locator, translated as "Uniform Resource Locator".
In layman's parlance, a URL is a string describing information resources on the Internet, mainly used in various WWW client programs and server programs.
URL can be used in a uniform format to describe a variety of information resources, including files, server addresses and directories.
The general format of the URL is (optional with square brackets []):
Protocol://hostname[:p ORT]/path/[;p Arameters][?query] #fragment

The format of the URL is made up of three parts:
① The first part is the protocol (or service method).
② the second part is the host IP address (and sometimes the port number) that contains the resource.
③ The third part is the specific address of the host resource, such as directory and filename.
The first part and the second part are separated by the "://" Symbol,
The second and third parts are separated by the symbol "/".
The first part and the second part are indispensable, and the third part can be omitted sometimes.

Five, URL and uri simple comparison

A URI is a lower-level abstraction of a URL, a string literal standard.
In other words, the URI belongs to the parent class, and the URL belongs to the subclass of the URI. A URL is a subset of the URI.
The URI is defined as a Uniform resource identifier;
The URL is defined as a Uniform Resource locator.
The difference is that the URI represents the path to the requesting server and defines such a resource.
The URL also shows how to access the resource (http://).

Let's look at a small example of two URLs.

Sample URL for the 1.HTTP protocol:
Use the Hypertext Transfer Protocol HTTP to provide resources for hypertext information Services.
Example: http://www.peopledaily.com.cn/channel/welcome.htm
Its computer domain name is www.peopledaily.com.cn.
The super text file (the file type. html) is the welcome.htm under the directory/channel.
This is a computer in China's People's Daily.
Example: http://www.rol.cn.net/talk/talk1.htm
Its computer domain name is www.rol.cn.net.
The super text file (the file type. html) is the talk1.htm under the directory/talk.
This is the address of the Swiss chat room, which allows you to enter the 1th room of the Swiss chat room.

2. The URL of the file
When a file is represented by a URL, the server is represented by file, followed by the host IP address, the file's access path (that is, the directory), and the filename.
Sometimes the directory and file names can be omitted, but the "/" symbol cannot be omitted.
Example: File://ftp.yoyodyne.com/pub/files/foobar.txt
The above URL represents a file stored in the pub/files/directory on the host ftp.yoyodyne.com, and the filename is foobar.txt.
Example: File://ftp.yoyodyne.com/pub
Represents the directory/pub on the host ftp.yoyodyne.com.
Example: file://ftp.yoyodyne.com/
Represents the root directory of the host ftp.yoyodyne.com.

The most important object of the crawler is the URL, which obtains the required file content according to the URL address, and then further deals with it.
Therefore, an accurate understanding of URLs is essential to understanding web crawlers.

Well, the basics are first introduced here, and then we're going to do a little bit of actual reptilian operations.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.