Conversion of data captured by python crawlers to PDF

Source: Internet
Author: User
Tags wkhtmltopdf
This article will share with you how to use python crawlers to convert Liao Xuefeng's Python tutorial to PDF, if you have any need, refer to this article to share with you the method and code for converting Liao Xuefeng's python tutorial into PDF using Python crawlers. if you have any need, refer

Writing crawlers does not seem to be more appropriate than using Python. the crawler tools provided by the Python community are so dazzling that you can write a crawler in minutes using libraries that can be used directly, today, I want to write a crawler that crawls Liao Xuefeng's Python tutorial into PDF e-books for you to read offline.

Before starting to write crawlers, let's analyze the page structure of website 1. the left side of the page is the directory outline of the tutorial. each URL corresponds to an article on the right, the top right is the title of the article, the middle is the body of the article, the content of the body is the focus of our attention, the data we want to crawl is the body of all web pages, below is the user's comment area, which is useless to us, so we can ignore it.

Tool preparation

After figuring out the basic structure of the website, you can start preparing the tool kit on which the crawler depends. Requests and beautifulsoup are two major crawlers. reuqests are used for network requests and beautifusoup is used for html data operations. With these two barrackets, we don't need a crawler framework like scrapy. it's a little cool for the mini program. In addition, since html files are converted into pdf files, the corresponding library must be supported. wkhtmltopdf is a very good tool that can be used to convert html files to pdf files on multiple platforms, development Kit is a Python package of wkhtmltopdf. First, install the following dependency packages,

Install wkhtmltopdf

pip install requestspip install beautifulsouppip install pdfkit

Install wkhtmltopdf

Windows platform directly downloads the stable version on wkhtmltopdf official website 2 for installation. after the installation is complete, add the execution PATH of the program to the system environment $ PATH variable, otherwise, the "No wkhtmltopdf executable found" error occurs when the development kit cannot find wkhtmltopdf ". Ubuntu and CentOS can be directly installed using the command line

$ sudo apt-get install wkhtmltopdf # ubuntu$ sudo yum intsall wkhtmltopdf   # centos

Crawler implementation

After everything is ready, you can go to the code. But before writing the code, sort out your thoughts. The purpose of the program is to save the html body of all URLs to the local device, and then convert these files into a pdf file using development kit. We split the task. First, we saved the html body corresponding to a URL to the local machine, and then found all the URLs to perform the same operation.

Use Chrome to find the tab of the page body and press F12 to find the p tag corresponding to the body:

The p is the body of the webpage. After the entire page is loaded locally with requests, you can use beautifulsoup to operate the HTML dom element to extract the body content.

The specific implementation code is as follows: Use the soup. find_all function to find the body label and save the content of the body part to the.html file.

def parse_url_to_html(url):  response = requests.get(url)  soup = BeautifulSoup(response.content, "html5lib")  body = soup.find_all(class_="x-wiki-content")[0]  html = str(body)  with open("a.html", 'wb') as f:    f.write(html)

Step 2: resolve all URLs on the left of the page. Find the menu label on the left in the same way

    Specific code implementation logic: because there are two class attributes of uk-nav-side on the page, the real directory list is the second one. All URLs are obtained, and the url-to-html function is also written in the first step.

    Def get_url_list (): "get all URL directory lists" response = requests. get ("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000") soup = BeautifulSoup (response. content, "html5lib") menu_tag = soup. find_all (class _ = "uk-nav-side") [1] urls = [] for li in menu_tag.find_all ("li "): url = "http://www.liaoxuefeng.com" + li. a. get ('href ') urls. append (url) return urls

    The last step is to convert html into a PDF file. Converting to a pdf file is very simple, because development kit encapsulates all the logic, you only need to call the function Development Kit. from_file

    Def save_pdf (htmls): "convert all html files into PDF files" options = {'page-size': 'Letter ', 'encoding': "UTF-8 ", 'custom-header': [('Accept-encoding', 'gzip ')]} using kit. from_file (htmls, file_name, options = options)

    Run the save_pdf function to generate an e-Book pdf file ,:

    1. Python free video tutorial

    2. Python object-oriented video tutorial

    3. Python Learning Manual

    The above is the details of converting the data captured by python crawlers to PDF. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.