Python crawler implementation tutorial converted to PDF e-book

Last Update:2017-05-14 Source: Internet

Author: User

Tags wkhtmltopdf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article will share with you how to use python crawlers to convert Liao Xuefeng's Python tutorial to PDF, if you have any need, refer to this article to share with you the method and code for converting Liao Xuefeng's python tutorial into PDF using Python crawlers. if you have any need, refer

Writing crawlers does not seem to be more appropriate than using Python. the crawler tools provided by the Python community are so dazzling that you can write a crawler in minutes using libraries that can be used directly, today, I want to write a crawler that crawls Liao Xuefeng's Python tutorial into PDF e-books for you to read offline.

Before starting to write crawlers, let's analyze the page structure of website 1. the left side of the page is the directory outline of the tutorial. each URL corresponds to an article on the right, the top right is the title of the article, the middle is the body of the article, the content of the body is the focus of our attention, the data we want to crawl is the body of all web pages, below is the user's comment area, which is useless to us, so we can ignore it.

Tool preparation

After figuring out the basic structure of the website, you can start preparing the tool kit on which the crawler depends. Requests and beautifulsoup are two major crawlers. reuqests are used for network requests and beautifusoup is used for html data operations. With these two barrackets, we don't need a crawler framework like scrapy. it's a little cool for the mini program. In addition, since html files are converted into pdf files, the corresponding library must be supported. wkhtmltopdf is a very good tool that can be used to convert html files to pdf files on multiple platforms, development Kit is a Python package of wkhtmltopdf. First, install the following dependency packages,

Install wkhtmltopdf

pip install requestspip install beautifulsouppip install pdfkit

Install wkhtmltopdf

Windows platform directly downloads the stable version on wkhtmltopdf official website 2 for installation. after the installation is complete, add the execution PATH of the program to the system environment $ PATH variable, otherwise, the "No wkhtmltopdf executable found" error occurs when the development kit cannot find wkhtmltopdf ". Ubuntu and CentOS can be directly installed using the command line

$ sudo apt-get install wkhtmltopdf # ubuntu$ sudo yum intsall wkhtmltopdf   # centos

Crawler implementation

After everything is ready, you can go to the code. But before writing the code, sort out your thoughts. The purpose of the program is to save the html body of all URLs to the local device, and then convert these files into a pdf file using development kit. We split the task. First, we saved the html body corresponding to a URL to the local machine, and then found all the URLs to perform the same operation.

Use Chrome to find the tab of the page body and press F12 to find the p tag corresponding to the body:

The p is the body of the webpage. After the entire page is loaded locally with requests, you can use beautifulsoup to operate the HTML dom element to extract the body content.

The specific implementation code is as follows: Use the soup. find_all function to find the body label and save the content of the body part to the.html file.

def parse_url_to_html(url):  response = requests.get(url)  soup = BeautifulSoup(response.content, "html5lib")  body = soup.find_all(class_="x-wiki-content")[0]  html = str(body)  with open("a.html", 'wb') as f:    f.write(html)

Step 2: resolve all URLs on the left of the page. Find the menu label on the left in the same way

Specific code implementation logic: because there are two class attributes of uk-nav-side on the page, the real directory list is the second one. All URLs are obtained, and the url-to-html function is also written in the first step.

Def get_url_list (): "get all URL directory lists" response = requests. get ("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000") soup = BeautifulSoup (response. content, "html5lib") menu_tag = soup. find_all (class _ = "uk-nav-side") [1] urls = [] for li in menu_tag.find_all ("li "): url = "http://www.liaoxuefeng.com" + li. a. get ('href ') urls. append (url) return urls

The last step is to convert html into a PDF file. Converting to a pdf file is very simple, because development kit encapsulates all the logic, you only need to call the function Development Kit. from_file

Def save_pdf (htmls): "convert all html files into PDF files" options = {'page-size': 'Letter ', 'encoding': "UTF-8 ", 'custom-header': [('Accept-encoding', 'gzip ')]} using kit. from_file (htmls, file_name, options = options)

Run the save_pdf function to generate an e-Book pdf file ,:

Summary

A total of less than 50 lines of code are added. However, it is slow. In fact, the code provided above omitted some details, such as how to get the title of the article, the img label of the body uses relative paths. to display images normally in pdf, you need to change the relative paths to absolute paths, and delete the saved html temporary files.

For more python crawler implementation tutorials to convert to PDF e-books, please follow the PHP Chinese network!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More