Python implements simple crawler sharing for crawling links on pages, and python crawling
In addition to C/C ++, I have also been familiar with many popular languages, such as PHP, java, javascript, and python. python can be said to be the most convenient to operate and has the least disadvantages.
I wanted to write Crawlers a few days ago. Later I discussed them with my friends and decided to write them together in a few days. The most important part of a crawler is crawling links on the page. Here I will simply implement it.
First, we need to use an open-source module, requests. This is not a python module. You need to download, unzip, and install it from the Internet:
Copy codeThe Code is as follows:
$ Curl-OL https://github.com/kennethreitz/requests/zipball/master
$ Python setup. py install
For windows users, click Download. Unzip the package and run the python setup. py install command locally to install the package. Https://github.com/kennethreitz/requests/zipball/master
I am also translating the documents of this module. After the translation is complete, I will send them to you (the English version is first published in the attachment ). As stated in its description, built for human beings is designed for humans. It is easy to use and you can read the document yourself. The simplest, requests. get () is to send a get request.
The Code is as follows:
Copy codeThe Code is as follows:
# Coding: UTF-8
Import re
Import requests
# Retrieving webpage content
R = requests. get ('HTTP: // www.163.com ')
Data = r. text
# Use regular expressions to search for all connections
Link_list = re. findall (r "(? <= Href = \ "). +? (? = \ ") | (? <= Href = \ '). +? (? = \ ') ", Data)
For url in link_list:
Print url
First, import it into the re and requests modules. The re module is a module that uses regular expressions.
Data = requests. get ('HTTP: // www.163.com '), submit a get request to the Netease homepage, and get a requests object r, r. text is the source code of the obtained webpage, which is saved in the string data.
Use the regular expression to search for all links in data. My regular expression is rough and I can directly obtain information between href = "" Or href =, this is the link information we want.
Re. findall returns a list that uses the for loop to traverse the list and output it:
This is part of all the connections I have obtained.
The above is a simple implementation of getting all links in the website, without handling any exceptions, without considering the hyperlink type, the code is for reference only. For more information about the requests module, see the attachment.