Python crawls all PDF documents in a single page

Source: Internet
Author: User

GitHub blog address, updates here may not be very timely. 1. Background

Recently found that the algorithm and data structure dropped a lot (in fact, the university is not how to learn, 囧rz), considering the recent project structure more and more complex, using it to practice ideas, it is intended to review the data structure and algorithm. Combine the recent learning of English and then simply use English. Then select a reference book "Data Structures and algorithms in Java".
At first, it was quite a struggle, slow down. As a result of the habit of ripping books appendix, so went to the book attached to the official website to see, found http://ww0.java4.datastructures.net/handouts/inside the PDF document is actually good, illustrated, as the understanding is a good material, Decisive to download AH. However, the results found that, a good many, this one to save as really deadly, think or what way to download it.

2. Implement

Consider all the languages that you have learned at the moment, which can be used to achieve and arrange the level:

    1. Java/android familiar with
    2. C # familiarity
    3. Python Understanding Syntax
    4. Javascript understands some
    5. Understanding syntax in C + +

In order to achieve this, of course, the simplest and fastest is the best. Considering that the university has been using C #, do you want to use it? But found that the OSX platform can only use mono, but also to be re-familiar. Java implementation is also unpleasant, from the time needed to consider. JavaScript is not familiar, it seems that you can use node. js to write (Atom is the use of it). Not familiar. C + + has not been used for many years, and, to implement a lot of code, especially troublesome. Consider the previous period of time just before codecademy learn grammar, take it to practiced hand it.
OK, OK with Python. Follow-up is how to request the network, parsing page HTML tags, extract download links, download files. Although do not understand how these in Python is implemented, but the process is determined, according to the process to the site to find out, here do not study the principle, realize the function.
Next is a variety of search engines searching for things, Google can, Baidu can also (different engine focus is not the same). Don't forget what the purpose is, search for the relevant information.
OK, after the search, determine the request network download Web page with requests, parse HTML with beautifulsoup, extract download link beautifulsoup, download the document (StackOverflow found a download file code).
And then they put them together. The code after the combination is as follows:

1 #file-name:pdf_download.py2   __author__='Rxread'3   ImportRequests4    fromBs4ImportBeautifulSoup5 6 7   defdownload_file (URL, index):8Local_filename = index+"-"+url.split ('/') [-1]9       #NOTE the stream=true parameterTenr = Requests.get (URL, stream=True) OneWith open (Local_filename,'WB') as F: A            forChunkinchR.iter_content (chunk_size=1024): -               ifChunk#filter out keep-alive new chunks - F.write (Chunk) the F.flush () -       returnLocal_filename -  -   #http://ww0.java4.datastructures.net/handouts/ +root_link="http://ww0.java4.datastructures.net/handouts/" -R=requests.get (Root_link) +   ifr.status_code==200: Asoup=BeautifulSoup (R.text) at       #print soup.prettify () -Index=1 -        forLinkinchSoup.find_all ('a'): -New_link=root_link+link.get ('href') -           ifNew_link.endswith (". pdf"): -File_path=download_file (NEW_LINK,STR (index)) in               Print "Downloading:"+new_link+" -"+File_path -Index+=1 to       Print "All Download finished" +   Else: -       Print "errors occur."
View Code

You can download all PDF documents locally by running the following code.

1 python pdf_download.py
View Code

3. Optimization

More than 30 lines of code, all done, it is simple and clear, sure enough to do Python for some script task is still good. It downloads 41 of documents.
The first download from the document does not have a serial number, so that you do not know the time, so I gave the file name preceded by a sequence number.
Other optimizations can be found in the following sections:

    1. Consider that some of the exceptions to the function are not handled and need to be processed later.
    2. Functions are not fully encapsulated, the downloaded file type support is not much, this can be extended according to their own needs.
    3. It may be possible to download fewer files, but with more files, it is necessary to use multiple threads (the right amount) or a thread pool to download, which speeds up the download.
    4. Some of the wording may not conform to the Python syntax specification, of course, written and not written is already 0 and 1 difference.
    5. Other details, such as PDF may be in uppercase PDF.
4. Appendix
    1. "Data structures and Algorithms in Java" (Michael T. Goodrich, Roberto tamassia) Download http://bookzz.org/or http://it-ebooks.i nfo/
      The following two sites are good book download site, conditional or buy a copy of the book to support the author it.
      Generally I will first download the ebook to see, appropriate to buy a paper version.
    2. Getting started with Python syntax Http://www.codecademy.com/zh/tracks/python

Above, that is the case.

This article from Rxread's Blog, welcome reprint, reprint please specify.
Welcome to discuss the exchange together.

Python crawls all PDF documents in a single page

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.