Python implements simple crawler sharing for crawling links on pages, and python crawling

Source: Internet
Author: User

Python implements simple crawler sharing for crawling links on pages, and python crawling

In addition to C/C ++, I have also been familiar with many popular languages, such as PHP, java, javascript, and python. python can be said to be the most convenient to operate and has the least disadvantages.

I wanted to write Crawlers a few days ago. Later I discussed them with my friends and decided to write them together in a few days. The most important part of a crawler is crawling links on the page. Here I will simply implement it.

First, we need to use an open-source module, requests. This is not a python module. You need to download, unzip, and install it from the Internet:

Copy codeThe Code is as follows:
$ Curl-OL https://github.com/kennethreitz/requests/zipball/master
$ Python setup. py install

For windows users, click Download. Unzip the package and run the python setup. py install command locally to install the package. Https://github.com/kennethreitz/requests/zipball/master

I am also translating the documents of this module. After the translation is complete, I will send them to you (the English version is first published in the attachment ). As stated in its description, built for human beings is designed for humans. It is easy to use and you can read the document yourself. The simplest, requests. get () is to send a get request.

The Code is as follows:

Copy codeThe Code is as follows:
# Coding: UTF-8
Import re
Import requests

# Retrieving webpage content
R = requests. get ('HTTP: // www.163.com ')
Data = r. text

# Use regular expressions to search for all connections
Link_list = re. findall (r "(? <= Href = \ "). +? (? = \ ") | (? <= Href = \ '). +? (? = \ ') ", Data)
For url in link_list:
Print url

First, import it into the re and requests modules. The re module is a module that uses regular expressions.

Data = requests. get ('HTTP: // www.163.com '), submit a get request to the Netease homepage, and get a requests object r, r. text is the source code of the obtained webpage, which is saved in the string data.

Use the regular expression to search for all links in data. My regular expression is rough and I can directly obtain information between href = "" Or href =, this is the link information we want.

Re. findall returns a list that uses the for loop to traverse the list and output it:

This is part of all the connections I have obtained.

The above is a simple implementation of getting all links in the website, without handling any exceptions, without considering the hyperlink type, the code is for reference only. For more information about the requests module, see the attachment.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.