Python implements methods for crawling HTML pages and saving them as PDF files

Last Update:2018-05-08 Source: Internet

Author: User

Tags parse error wkhtmltopdf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces Python's method for grabbing HTML web pages and saving them as PDF files. It analyzes the installation of PyPDF2 module and Python's related operation skills for grabbing HTML pages and generating pdf files based on PyPDF2 module in combination with examples. Friends can refer
The example in this article describes how Python can crawl HTML pages and save them as PDF files. To share with you for your reference, the details are as follows:

I. Introduction

Today, we will scrape the HTML webpage and save it as a PDF.

Preparation

1. Installation and use of PyPDF2 (for merging PDFs):

PyPDF2 version: 1.25.1

installation:

pip install PyPDF2
Example of use:

from PyPDF2 import PdfFileMerger
merger = PdfFileMerger ()
input1 = open ("hql_1_20.pdf", "rb")
input2 = open ("hql_21_40.pdf", "rb")
merger.append (input1)
merger.append (input2)
# Write to an output PDF document
output = open ("hql_all.pdf", "wb")
merger.write (output)
2. Requests and beautifulsoup are two major artifacts of crawlers. Reuqests are used for network requests, and beautifusoup is used to manipulate html data. With these two shuttles, work is done with ease. We don't need a crawler framework such as scrapy. Such a small program comes with a little meaning of killing chickens with a knife. In addition, since you are converting html files to pdf, you must also have corresponding library support. Wkhtmltopdf is a very useful tool. It can use html to pdf conversion for multiple platforms. Pdfkit is a Python package for wkhtmltopdf. First install the following dependencies

pip install requests
pip install beautifulsoup4
pip install pdfkit
3. Install wkhtmltopdf

For Windows, download the stable version of wkhtmltopdf from http://wkhtmltopdf.org/downloads.html for installation. After the installation is complete, add the program execution path to the $ PATH variable in the system environment. Otherwise, pdfkit cannot find wkhtmltopdf and an error occurs "No wkhtmltopdf executable found". Ubuntu and CentOS can be installed directly from the command line

$ sudo apt-get install wkhtmltopdf # ubuntu
$ sudo yum intsall wkhtmltopdf # centos
Data preparation

1. Get the URL of each article

def get_url_list ():
"" "
Get a list of all URL directories
: return:
"" "
response = requests.get ("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")
soup = BeautifulSoup (response.content, "html.parser")
menu_tag = soup.find_all (class _ = "uk-nav uk-nav-side") [1]
urls = []
for li in menu_tag.find_all ("li"):
url = "http://www.liaoxuefeng.com" + li.a.get ('href')
urls.append (url)
return urls
2. Save the HTML file of each article with a template through the article url

html template:

html_template = "" "
<! DOCTYPE html>
<html lang = "en">
<head>
<meta charset = "UTF-8">
</ head>
<body>
{content}
</ body>
</ html>
"" "
Save:

def parse_url_to_html (url, name):
"" "
Parse URL and return HTML content
: param url: Parsed URL
: param name: Saved html file name
: return: html
"" "
try:
response = requests.get (url)
soup = BeautifulSoup (response.content, 'html.parser')
# Body
body = soup.find_all (class _ = "x-wiki-content") [0]
# Title
title = soup.find ('h4'). get_text ()
# The title is added to the front of the text and displayed in the center
center_tag = soup.new_tag ("center")
title_tag = soup.new_tag ('h1')
title_tag.string = title
center_tag.insert (1, title_tag)
body.insert (1, center_tag)
html = str (body)
Change the relative path of the src of the img tag in the # body to an absolute path
pattern = "(<img. *? src = \") (. *?) (\ ")"
def func (m):
if not m.group (3) .startswith ("http"):
rtn = m.group (1) + "http://www.liaoxuefeng.com" + m.group (2) + m.group (3)
return rtn
else:
return m.group (1) + m.group (2) + m.group (3)
html = re.compile (pattern) .sub (func, html)
html = html_template.format (content = html)
html = html.encode ("utf-8")
with open (name, 'wb') as f:
f.write (html)
return name
except Exception as e:
logging.error ("Parse error", exc_info = True)
3. Convert html to pdf

def save_pdf (htmls, file_name):
"" "
Save all html files to pdf file
: param htmls: list of html files
: param file_name: pdf file name
: return:
"" "
options = {
'page-size': 'Letter',
'margin-top': '0.75in',
'margin-right': '0.75in',
'margin-bottom': '0.75in',
'margin-left': '0.75in',
'encoding': "UTF-8",
'custom-header': [
('Accept-Encoding', 'gzip')
],
'cookie': [
('cookie-name1', 'cookie-value1'),
('cookie-name2', 'cookie-value2'),
],
'outline-depth': 10,
}
pdfkit.from_file (htmls, file_name, options = options)
4. Combine the converted single PDF into one PDF

merger = PdfFileMerger ()
for pdf in pdfs:
merger.append (open (pdf, 'rb'))
print u "Merge completion" + str (i) + 'pdf' + pdf
Full source code:

# coding = utf-8
import os
import re
import time
import logging
import pdfkit
import requests
from bs4 import BeautifulSoup
from PyPDF2 import PdfFileMerger
html_template = "" "
<! DOCTYPE html>
<html lang = "en">
<head>
<meta charset = "UTF-8">
</ head>
<body>
{content}
</ body>
</ html>
"" "
def parse_url_to_html (url, name):
"" "
Parse URL and return HTML content
: param url: Parsed URL
: param name: Saved html file name
: return: html
"" "
try:
response = requests.get (url)
soup = BeautifulSoup (response.content, 'html.parser')
# Body
body = soup.find_all (class _ = "x-wiki-content") [0]
# Title
title = soup.find ('h4'). get_text ()
# The title is added to the front of the text and displayed in the center
center_tag = soup.new_tag ("center")
title_tag = soup.new_tag ('h1')
title_tag.string = title
center_tag.insert (1, title_tag)
body.insert (1, center_tag)
html = str (body)
Change the relative path of the src of the img tag in the # body to an absolute path
pattern = "(<img. *? src = \") (. *?) (\ ")"
def func (m):
if not m.group (3) .startswith ("http"):
rtn = m.group (1) + "http://www.liaoxuefeng.com" + m.group (2) + m.group (3)
return rtn
else:
return m.group (1) + m.group (2) + m.group (3)
html = re.compile (pattern) .sub (func, html)
html = html_template.format (content = html)
html = html.encode ("utf-8")
with open (name, 'wb') as f:
f.write (html)
return name
except Exception as e:
loggin
g.error ("Parse error", exc_info = True)
def get_url_list ():
"" "
Get a list of all URL directories
: return:
"" "
response = requests.get ("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")
soup = BeautifulSoup (response.content, "html.parser")
menu_tag = soup.find_all (class _ = "uk-nav uk-nav-side") [1]
urls = []
for li in menu_tag.find_all ("li"):
url = "http://www.liaoxuefeng.com" + li.a.get ('href')
urls.append (url)
return urls
def save_pdf (htmls, file_name):
"" "
Save all html files to pdf file
: param htmls: list of html files
: param file_name: pdf file name
: return:
"" "
options = {
'page-size': 'Letter',
'margin-top': '0.75in',
'margin-right': '0.75in',
'margin-bottom': '0.75in',
'margin-left': '0.75in',
'encoding': "UTF-8",
'custom-header': [
('Accept-Encoding', 'gzip')
],
'cookie': [
('cookie-name1', 'cookie-value1'),
('cookie-name2', 'cookie-value2'),
],
'outline-depth': 10,
}
pdfkit.from_file (htmls, file_name, options = options)
def main ():
start = time.time ()
file_name = u "liaoxuefeng_Python3_tutorial"
urls = get_url_list ()
for index, url in enumerate (urls):
parse_url_to_html (url, str (index) + ".html")
htmls = []
pdfs = []
for i in range (0,124):
htmls.append (str (i) + '. html')
pdfs.append (file_name + str (i) + '. pdf')
save_pdf (str (i) + '. html', file_name + str (i) + '. pdf')
print u "Conversion completed" + str (i) + 'html'
merger = PdfFileMerger ()
for pdf in pdfs:
merger.append (open (pdf, 'rb'))
print u "Merge completion" + str (i) + 'pdf' + pdf
output = open (u "廖雪峰 Python_all.pdf", "wb")
merger.write (output)
print u "Output PDF succeeded!"
for html in htmls:
os.remove (html)
print u "Delete temporary files" + html
for pdf in pdfs:
os.remove (pdf)
print u "Delete temporary files" + pdf
total_time = time.time ()-start
print (u "Total time:% f seconds"% total_time)
if __name__ == '__main__':
main ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More