Python Downloads Entire site

Source: Internet
Author: User

A python-implemented download of the entire Web site tool.

The core process is simple:

1. Enter the website address

2. URL, get the content of the response.

3. Depending on the HTTP header of the response, if the type is HTML, the execution starts from step 4th. If it is a different type, it is executed from step 6th.

4. Extract the href and src attribute values from the HTML.

5. Add the extracted URL to the download queue. If the URL already exists in the download queue, discard it.

6. Then open the next URL in the URL queue.

7. Continue to loop through step 2nd, knowing that URLs in the URL queue are processed.

This step looks very simple, but there are a lot of details to deal with half a day.

The various types of URLs, and how to name a URL with a question mark in the suffix.

At present, one of the problems in this procedure is:

1 when the URL is opened, it may block to a place where execution does not go down. This needs to be studied urllib.request.

2 There is also a large URL queue length, multi-threaded download speed back faster.

3 English comments do not know how many errors. Because when writing comments, if you use Chinese, you need to switch the input method to and fro, so use English.

The current Program department supports multi-threading and can be perfected later.

If there are students interested in perfect, very welcome.

Source code Download: http://download.csdn.net/detail/jiangxiaoma111/8002631

Personal e-mail: [Email protected]

Python Downloads Entire site

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.