Python crawler data is converted into PDF

Source: Internet
Author: User
Tags wkhtmltopdf
This article is to share the use of Python crawler implementation of the "Liaoche Python Tutorial" into a PDF method and code, the need for small partners can refer to the following

It seems no more appropriate to write crawlers than with Python, the Python community provides a lot of crawler tools to dazzle you, all kinds of libraries can be directly used to write a reptile in minutes can be written out, today to write a crawler, the Liaoche Python tutorial climbed down to make PDF E-books are easy for everyone to read offline.

Before beginning to write the crawler, we first analyze the Site 1 page structure, the left side of the page is the Tutorial directory outline, each URL corresponds to the right of an article, the right is the title of the article, the middle is the body of the article, the body content is the focus of our concern, the data we want to crawl is the body of all pages, Below is the user's comment area, the comment area is no use to us, so we can ignore it.

Tool Preparation

After figuring out the basic structure of the site, you can start preparing the toolkit that the crawler relies on. Requests, BeautifulSoup is the crawler of the two great artifacts, reuqests for network requests, Beautifusoup for manipulating HTML data. With these two shuttles, do the work to be neat, scrapy such a crawler frame we do not need, small program sent it a little overkill meaning. In addition, since the HTML file is converted to PDF, then also have the corresponding library support, Wkhtmltopdf is a very good tool, it can be used for multi-platform HTML to PDF conversion, Pdfkit is wkhtmltopdf python package. Install the following dependency packages first,

Then install Wkhtmltopdf

Pip Install requestspip install Beautifulsouppip install Pdfkit

Installing Wkhtmltopdf

Windows platform directly on the Wkhtmltopdf official website 2 download stable version of the installation, after the installation of the program to add the execution path to the system environment $PATH variables, otherwise pdfkit can not find wkhtmltopdf error "no Wkhtmltopdf E Xecutable found ". Ubuntu and CentOS can be installed directly from the command line

$ sudo apt-get install wkhtmltopdf # ubuntu$ sudo yum intsall wkhtmltopdf   # CentOS

Crawler implementation

When everything is ready, you can put the code on, but before you write the code, you should tidy up your thoughts. The purpose of the program is to save all the HTML body parts of the URL to local, and then use Pdfkit to convert the files into a PDF file. We split the task, the first is to save a URL corresponding to the HTML body to local, and then find all the URL to perform the same operation.

To find the label of the body part of the page in Chrome, press F12 to find the P tag for the body: the <p > P is the body of the page. Once the entire page has been loaded locally with requests, you can use BeautifulSoup to manipulate the DOM elements of the HTML to extract the contents of the body.


The specific implementation code is as follows: Use the Soup.find_all function to find the body label, and then save the contents of the body part to the a.html file.

def parse_url_to_html (URL):  response = requests.get (URL)  soup = BeautifulSoup (response.content, "Html5lib")  BODY = Soup.find_all (class_= "x-wiki-content") [0]  html = str (body)  with open ("a.html", ' WB ') as F:    F.write (HTML)

The second step is to parse out all the URLs on the left side of the page. In the same way, find the menu label on the left<ul >

Specific code implementation logic: Because the page has two Uk-nav Uk-nav-side class attribute, and the real directory list is the second one. All URLs get, URL to HTML function is also written in the first step.

Def get_url_list (): ""  gets all the URL directory list  "" "  response = Requests.get (" Http://www.liaoxuefeng.com/wiki /0014316089557264a6b348958f449949df42a6d3a2e542c000 ")  soup = BeautifulSoup (response.content," Html5lib ")  Menu_tag = Soup.find_all (class_= "Uk-nav uk-nav-side") [1]  urls = [] for  Li in Menu_tag.find_all ("Li"): C15/>url = "http://www.liaoxuefeng.com" + li.a.get (' href ')    urls.append (URL)  return URLs

The final step is to convert the HTML into a PDF file. Converting to PDF file is very simple, because pdfkit all the logic is encapsulated, you just need to call the function pdfkit.from_file

def save_pdf (htmls): "" "  convert all HTML files to PDF file" "  options = {    ' page-size ': ' Letter ',    ' Encoding ': "UTF-8",    ' Custom-header ': [      (' accept-encoding ', ' gzip ')    ]  }  pdfkit.from_file ( HTMLS, file_name, options=options)

To execute the Save_pdf function, the ebook PDF file is generated:

Summarize

The total amount of code added up to less than 50 lines, but, wait, actually the code given above omitted some details, for example, how to get the title of the article, the content of the IMG tag using the relative path, if you want to display the picture in the PDF will need to change the relative path to absolute path, and the saved HTML Temporary files are deleted, and these details are CE on GitHub.

"Recommended"

1. Python Free video tutorial

2. Python Object-oriented video tutorial

3. Python Learning Manual

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.