chance to see a topic about reptiles on the know-how < what cool, interesting, useful things can you do with crawler technology? Because of the intense curiosity and the feeling that it is a tall thing to write a reptile, I have an interest in reptiles.
About the definition of web crawler is not much to say, do not know, please click to view the Baidu Encyclopedia web crawler, Wikipedia web crawler
There are many programming languages can write web crawler, but each has its advantages and disadvantages, here I choose to write a crawler in Python language, because Python is a very suitable to write a crawler language, using it to implement the crawler's code volume is much less than other languages, And the Python language is particularly good for the encapsulation of modules such as network programming, and its linguistic features make it possible for many programmers to write programs. In order to learn the crawler, I contacted the Python language, and in the continuous study, the crawler to combine it, so as to achieve the crawler. The version I studied and used was Python3.
Learning web crawlers requires some basic knowledge:
-
- HTML is used to understand the composition of the entire Web page, so that it is easy to crawl from the web.
- HTTP protocol for understanding the composition of URLs so that URLs can be resolved
- Python is used to write related programs to implement crawlers
The first crawler I learned was to crawl the source code of a webpage. Do not think that access to the Web source is a very small and simple program, it is the basis of the crawler, it is essential. Here is the code that I understand and implement myself, if there is something wrong, please point it out so that you can learn to improve.
1 #-*-coding:utf-8-*-#设置编码类型为utf-82 ImportRequests#Import the relevant request module3 4URL ='http://www.jianshu.com/' #page URL to get (Pinterest home)5Response = requests.get (URL)#Get the status code for a Web connection via get () in requests6Content = Response.text#get information about a Web page from the returned status code via text7 Print(content)#output the source to the console
Python crawler learning to get the Web source