Python Reptile (i): basic concept

Source: Internet
Author: User

Define a web crawlerWeb Spider, also known as Spider, Network robot, also known as Web Chaser. According to certain rules, the program or script of the dimension network information. Other infrequently used names are ants, self-indexing, simulation programs, or worms.

Suppose the internet is likened to a spider web. Spiders, then, are those crawling around the Internet.

Web spiders are looking for Web pages through the URL of a Web page. Read the contents of a Web page from one page of the site (usually the homepage). Find the other link addresses in the Web page, and then look for the next page through these link addresses, so you can keep looping until all the pages of the site have been crawled. Assuming the entire Internet as a site, then the network spider can use this principle to the Internet all the pages are crawled down. In this way, the web crawler is a crawling program, a crawl Web page program.
simply put, the basic task of a web crawler is to crawl Web content .
URL Concept Crawl Web page In fact, the reader usually use Internet Explorer to browse the Web page is the same reason.

Let's say you enter the address of www.baidu.com in the address bar of the browser. The process of opening a Web page is actually a browser as a "client", sending a request to the server side. "Catch" The file on server side to local. Then explain and show.

HTML is a markup language that tags content and parses and differentiates it.

The function of the

Browser is to parse the obtained HTML code. Then turn the original code into the site page we see directly.
         before understanding the URL, The first thing to understand is the URI concept .
Each of the available resources on the Web. such as HTML documents, images, video clips, programs, etc. are represented by a common resource identifier (Universal Resource Identifier. URI) to locate.  uri usually consists of three parts:
         ① access to resource naming mechanism,
         ② the host name of the repository.
         ③ The name of the resource itself, represented by the path.
such as uri:http://www.baidu.com.cn/myhtml/html1223/. We can explain it this way:
       ① This is a resource that can be visited via the HTTP protocol,
       ② is located on the host www.baidu.com.cn ,
       ③ access via the path"/HTML/HTML40 ".


The concept of URLs A URL is a subset of the URI . It is the abbreviation of Uniform Resource Locator, translated as "Uniform Resource Locator". In layman's words, a URL is a string of descriptive information resources on the Internet. Mainly used in various WWW client programs and server programs. The URL can be used in a unified format to describe various information resources, including files, server addresses and folders.

Sample URL Demo

Sample URL Demo for 1.HTTP protocol:
Use Super Text Transfer Protocol HTTP. A resource that provides a hypertext information service.

Example: http://www.peopledaily.com.cn/channel/welcome.htm
Its computer domain name is www.peopledaily.com.cn.
A super Text file (file type. html) is a welcome.htm under folder/channel.
This is a computer of the People's Daily in China.


Example: http://www.rol.cn.net/talk/talk1.htm
Its computer domain name is www.rol.cn.net.
A super Text file (file type. html) is a talk1.htm under folder/talk.
This is the address of the chat room, which can enter the 1th room of the chat room.


2. The URL of the file
When a file is represented by a URL, the server mode is represented by files. The following is the host IP address, file access path (that is, folder) and file name information.
Sometimes you can omit the folder and file name, but the "/" symbol cannot be omitted.


Example: File://ftp.yoyodyne.com/pub/files/foobar.txt
The above URL represents a file stored in the Pub/files/folder on the host ftp.yoyodyne.com. The file name is Foobar.txt.


Example: File://ftp.yoyodyne.com/pub
Represents the folder/pub on the host ftp.yoyodyne.com.


Case: file://ftp.yoyodyne.com/
Ftp.yoyodyne.com the root folder on behalf of the host.

Python Reptile (i): basic concept

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.