Python captures all PDF documents on a single webpage
1. background recently, I found that algorithms and data structures have fallen a lot (in fact, the university is not really good at learning, just rz). Considering that the recent project structure is getting more and more complex, I used it to train my ideas, I plan to review the data structure and algorithms. In combination with learning English recently, you can simply use English. Select a reference book "Data Structures and Algorithms in Java". At the beginning, it was quite difficult. Because of the habit of reading the book appendix, so I went to the book with the official website to see the next, found that the PDF document included in the http://ww0.java4.datastructures.net/handouts/ is really good, illustrated, as understanding is a good material, make sure you want to download it. However, Nima found that there were a lot of them. It was really terrible to save them one by one. Think about how to download them. 2. the implementation considers all the languages that have been learned and can be used for implementation. The degree of arrangement is as follows: java/Android familiar with C # Familiar with Python understand the syntax Javascript understand some C/C ++ understand the syntax in order to achieve this, of course, is the simplest and fastest. Considering that the University has been using C #, do you need it? However, we found that the OSX platform can only use Mono, and we have to get familiar with it again. Java implementation is also not fast, considering the time needed. Javascript is not familiar. It seems that you can use node. js to write (atom is used ). Unfamiliar. C/C ++ has never been used for many years, and it is especially troublesome to implement a lot of code. Consider the fact that I learned the syntax in Codecademy some time ago. Use it to practice it. OK. Confirm that Python is used. The next step is how to request the network, parse the html tag of the webpage, extract the download link, and download the file. Although I don't know how to implement these functions in Python, the process is definite. I will go to the website to find the ready-made process according to the process. I will not study the principle here. Just implement the function. The next step is to search for things by various search engines, Google or Baidu (different engines are different ). Do not forget what the purpose is and search for relevant materials. After searching, make sure to request the network to download the webpage using requests, parse the html using BeautifulSoup, extract the download link BeautifulSoup, and download the document (a piece of code for downloading the file is found in stackoverflow ). Then they are combined. The Combined Code is as follows: Copy code 1 # file-name: pai_download.py 2 _ author _ = 'rxread' 3 import requests 4 from bs4 import BeautifulSoup 5 6 7 def download_file (url, index ): 8 local_filename = index + "-" + url. split ('/') [-1] 9 # NOTE the stream = True parameter10 r = requests. get (url, stream = True) 11 with open (local_filename, 'wb') as f: 12 for chunk in r. iter_content (chunk_size = 1024): 13 if chunk: # filter out keep-alive New chunks14 f. write (chunk) 15 f. flush () 16 return local_filename17 18 # http://ww0.java4.datastructures.net/handouts/19 root_link = "http://ww0.java4.datastructures.net/handouts/" 20 r = requests. get (root_link) 21 if r. status_code = 200:22 soup = BeautifulSoup (r. text) 23 # print soup. pretbid () 24 index = 125 for link in soup. find_all ('A'): 26 new_link = root_link + link. get ('href ') 27 if new_link.endswith (". pdf "): 28 File_path = download_file (new_link, str (index) 29 print "downloading:" + new_link + "->" + file_path30 index ++ = 131 print "all download finished" 32 else: 33 print "errors occur. "Copy the code and run the following code to download all pdf files to your local device. 1 python cmd_download.py 3. Optimized more than 30 lines of code, all done, it is concise and clear, it is really good to do Python for some script tasks. It downloads 41 Documents. The document downloaded at the beginning has no serial number, so I don't know the sequence when reading it. So I added a serial number before the file name. For other optimizations, refer to the following: consider that some exceptions of the function are not handled and need to be processed later. Functions are not fully encapsulated, and the types of downloaded files are not supported much. This can be expanded as needed. This may be the case when the number of downloaded files is small. However, if there are many files, it is necessary to use multiple threads (a proper number) or thread pools to download the files, thus speeding up the download speed. Some writing methods may not comply with the python syntax specifications. Of course, the difference between writing and not writing is already 0 and 1. Other details, such as pdf, may be uppercase PDF.