Basic knowledge learning of Python web crawler

Last Update:2016-05-12 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

python There are some simple friends knowledge python programming language has a very powerful function, that is python web crawler ( http://www.maiziedu.com/course/python/645-9570/ ) , a reference to python python Crawler and scrapy et cetera, come here today for a simple understanding of learning python crawler basic knowledge, with a certain knowledge of the crawler, later learning scrapy , urllib , it will be relatively easy.

Crawler:

Web crawler is a program that automatically extracts Web pages, and it is an important component of search engine to download Web pages from the World Wide Web. The traditional crawler starts from the URL of one or several initial web pages, obtains the URL on the initial page, and in the process of crawling the Web page, continuously extracts the new URL from the current page into the queue until it satisfies the system's stop condition.

The work flow of the focus crawler is more complex, and it is necessary to filter the links that are irrelevant to the topic according to a certain page analysis algorithm, and keep the useful links and put them into the queue of URLs waiting to be crawled. It then selects the next page URL to crawl from the queue according to a certain search strategy, and repeats the process until a certain condition of the system is reached.

In addition, all crawled Web pages will be stored by the system, for certain analysis, filtering, and indexing, so that after the query and retrieval, for the focus of the crawler, the results of this process may also provide feedback and guidance for the subsequent crawl process.

There are three main problems that need to be addressed in focus crawlers:

(1) Description or definition of the grab target;

(2) Analysis and filtering of web pages or data;

(3) A search strategy for URLs.

The process of crawling Web pages:

The process of crawling Web pages is similar to the way we usually open Web pages in our browser, but the browser we use will automatically translate the crawled HTML code into something we can easily understand. Open the browser, press the F12 key of the computer, you can view the HTML code of the page.

What is a URI

Each resource available on the Web-HTML documents, images, video clips, programs, and so on-is positioned by a generic resource identifier (Universal Resource Identifier, or "URI").
What is a URL:
The URL is the abbreviation for Uniform Resource location, translated as "Uniform Resource Locator." In layman's words, URLs are strings used to describe information resources on the Internet, and URLs can be used to describe a variety of information resources in a uniform format, including files, server addresses, and directories.

The difference between a URI and a URL is that both the URI and the URL define what the resource is. The URL also defines how to get the resource.

The base URL contains the schema (or protocol), the server name (or IP address), the path, and the file name, such as "Protocol://authorization/path? Query". The general Uniform Resource Identifier syntax for the complete, authoritative section looks like this: protocol://Username: password @ subdomain. Domain name. TLD: Port number/directory/filename. file suffix? parameter = value # flag

Python crawler-related packages used in the process:

Urllib:urllib provides the ability to use the program to execute various HTTP requests.

URLLIB2:URLLIB2 is a Python module that gets the URL (Uniform resourcelocators, a unified resource-addressable device). It provides a very concise interface in the form of a urlopen function. This makes it possible to obtain URLs with a variety of protocols. It also provides a slightly more complex interface to handle common situations-such as basic authentication, cookies, proxies, and so on. These are handled by objects called opener and handler.

Re: regular expression module, which specifies a matching character set, which provides functions that can be used to check whether the given string matches a specified regular expression.

Beautiful soup:beautiful Soup provides some simple, Python-style functions for navigating, searching, and modifying analysis trees. It is a toolkit that provides users with the data they need to crawl by parsing the document, because it is simple, so it is possible to write a complete application without much code. Beautiful soup automatically converts the input document to Unicode encoding, and the output document is converted to UTF-8 encoding. You don't have to think about encoding, unless the document does not specify an encoding, Beautiful soup cannot automatically recognize the encoding. Then, you just need to explain the original encoding method. Beautiful soup has become as good a Python interpreter as lxml and Html6lib, providing users with the flexibility to provide different analytic strategies or strong speeds.

Requests:requests is a Python HTTP client library, similar to URLLIB,URLLIB2, why do you sometimes use requests instead of URLLIB2? This is stated in the official documentation: Python's standard library URLLIB2 provides most of the HTTP functionality needed, but it is not as simple as requests to implement some simple functionality.

The above is the basic knowledge of Python crawler, if you have mastered, welcome to see this article:python network crawler combat -scrapy(/http) www.maiziedu.com/course/python/458-7430/)

Python web crawler Basics Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More