Python crawler learning to get the Web source

Source: Internet
Author: User

chance to see a topic about reptiles on the know-how < what cool, interesting, useful things can you do with crawler technology? Because of the intense curiosity and the feeling that it is a tall thing to write a reptile, I have an interest in reptiles.

About the definition of web crawler is not much to say, do not know, please click to view the Baidu Encyclopedia web crawler, Wikipedia web crawler

There are many programming languages can write web crawler, but each has its advantages and disadvantages, here I choose to write a crawler in Python language, because Python is a very suitable to write a crawler language, using it to implement the crawler's code volume is much less than other languages, And the Python language is particularly good for the encapsulation of modules such as network programming, and its linguistic features make it possible for many programmers to write programs. In order to learn the crawler, I contacted the Python language, and in the continuous study, the crawler to combine it, so as to achieve the crawler. The version I studied and used was Python3.

Learning web crawlers requires some basic knowledge:

      1. HTML is used to understand the composition of the entire Web page, so that it is easy to crawl from the web.
      2. HTTP protocol for understanding the composition of URLs so that URLs can be resolved
      3. Python is used to write related programs to implement crawlers

The first crawler I learned was to crawl the source code of a webpage. Do not think that access to the Web source is a very small and simple program, it is the basis of the crawler, it is essential. Here is the code that I understand and implement myself, if there is something wrong, please point it out so that you can learn to improve.

1 #-*-coding:utf-8-*-#设置编码类型为utf-82 ImportRequests#Import the relevant request module3 4URL ='http://www.jianshu.com/'  #page URL to get (Pinterest home)5Response = requests.get (URL)#Get the status code for a Web connection via get () in requests6Content = Response.text#get information about a Web page from the returned status code via text7 Print(content)#output the source to the console

Python crawler learning to get the Web source

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.