A python-implemented download of the entire Web site tool.
The core process is simple:
1. Enter the website address
2. URL, get the content of the response.
3. Depending on the HTTP header of the response, if the type is HTML, the execution starts from step 4th. If it is a different type, it is executed from step 6th.
4. Extract the href and src attribute values from the HTML.
5. Add the extracted URL to the download queue. If the URL already exists in the download queue, discard it.
6. Then open the next URL in the URL queue.
7. Continue to loop through step 2nd, knowing that URLs in the URL queue are processed.
This step looks very simple, but there are a lot of details to deal with half a day.
The various types of URLs, and how to name a URL with a question mark in the suffix.
At present, one of the problems in this procedure is:
1 when the URL is opened, it may block to a place where execution does not go down. This needs to be studied urllib.request.
2 There is also a large URL queue length, multi-threaded download speed back faster.
3 English comments do not know how many errors. Because when writing comments, if you use Chinese, you need to switch the input method to and fro, so use English.
The current Program department supports multi-threading and can be perfected later.
If there are students interested in perfect, very welcome.
Source code Download: http://download.csdn.net/detail/jiangxiaoma111/8002631
Personal e-mail: [Email protected]
Python Downloads Entire site