Writing a web crawler in Python (i): crawl the meaning of the Web page and the basic composition of the URL

Source: Internet
Author: User
Tags resource in python

The definition of web crawler

Network crawler, Web Spider, is a very image of the name.

The internet is likened to a spider web, so spider is the spider crawling up and down the Internet.

Web spiders look for Web pages by their link addresses.

From a Web page (usually the homepage), read the content of the page, find the other link address in the page,

Then look for the next page through these link addresses, and keep looping until all the Web pages are crawled.

If the entire Internet as a Web site, then the Web spider can use this principle of the Internet all the pages are crawled down.

In this way, the web crawler is a crawling program, a crawl Web page program.

The basic operation of web crawler is to crawl Web pages.

So how do you get the page you want?

Let's start with the URL first.

Second, the process of browsing the web

The process of crawling a webpage is actually the same as the reader's browsing the Web page using IE's browser.

For example, you enter www.baidu.com this address in the address bar of the browser.

The process of opening a Web page is actually the browser as a browsing "client", sent a request to the server side, the server side of the file "catch" to the local, and then explain, show.

HTML is a markup language that uses tags to mark content and parse and differentiate it.

The browser's function is to parse the acquired HTML code and then turn the original code into the site page we see directly.

Iii. concepts and examples of URIs

Simply put, the URL is the www.baidu.com string that is entered at the browser end.

Before you understand URLs, you first need to understand the concept of URIs.

What is a URI?

Each available resource on the Web, such as HTML documents, images, video clips, programs, and so on, is positioned by a generic resource identifier (Universal Resource Identifier, URI).

A URI is usually made up of three parts:

① the naming mechanism for accessing resources;

② host name for storing resources;

The name of the ③ resource itself, represented by the path.

As the following URI:

http://www.why.com.cn/myhtml/html1223/

We can explain it this way:

① This is a resource that can be accessed through the HTTP protocol,

② is located on the host www.webmonkey.com.cn,

③ access through the path "/HTML/HTML40".

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.