Python crawler: Convert liao Xuefeng tutorial to PDF ebook

Source: Internet
Author: User
Tags wkhtmltopdf

It seems no more appropriate to write crawlers than with Python, the Python community provides a lot of crawler tools to dazzle you, all kinds of library can be directly used to write a reptile in minutes can be written out, today try to write a crawler, Liaoche Teacher's Python tutorial climbed down to make PDF E-books are easy to read offline.

Before beginning to write a crawler, we first analyze the site structure of the page, the left side of the page is the Tutorial directory outline, each URL corresponds to the right of an article, the right is the title of the article, the middle is the body of the article, the body is the main content of our concern, the data we want to crawl is the body of all pages Below is the user's comment area, the comment area is no use to us, so we can ignore it.

Tool Preparation

After figuring out the basic structure of the site, you can start preparing the toolkit that the crawler relies on. Requests, BeautifulSoup is the crawler of the two great artifacts, reuqests for network requests, Beautifusoup for manipulating HTML data. With these two shuttles, do the work to be neat, scrapy such a crawler frame we do not need, small program sent it a little overkill meaning. In addition, since the HTML file is converted to PDF, then also have the corresponding library support, Wkhtmltopdf is a very good tool, it can be used for multi-platform HTML to PDF conversion, Pdfkit is wkhtmltopdf python package. Install the following dependency package first, then install Wkhtmltopdf

PIP Install requests

Pip Install Beautifulsoup4

Pip Install Pdfkit

Installing Wkhtmltopdf

Windows platform directly on the wkhtmltopdf website download stable version of the installation, after the installation of the program to add the execution path to the system environment $PATH variables, otherwise pdfkit can not find wkhtmltopdf error "no wkhtmltopdf ex Ecutable found ". Ubuntu and CentOS can be installed directly from the command line

$ sudo apt-get install wkhtmltopdf # Ubuntu

$ sudo yum intsall wkhtmltopdf # CentOS

Crawler implementation

When everything is ready, you can put the code on, but before you write the code, you should tidy up your thoughts. The purpose of the program is to save all the HTML body parts of the URL to local, and then use Pdfkit to convert the files into a PDF file. We split the task, the first is to save a URL corresponding to the HTML body to local, and then find all the URL to perform the same operation.

Use Chrome to find the label of the body of the page, press F12 to find the corresponding div tag: <div class= "X-wiki-content", the div is the body of the page. Once the entire page has been loaded locally with requests, you can use BeautifulSoup to manipulate the DOM elements of the HTML to extract the contents of the body.

The specific implementation code is as follows: Use the Soup.find_all function to find the body label, and then save the contents of the body part to the a.html file.

def parse_url_to_html (URL):

Response = requests.get (URL)

Soup = BeautifulSoup (response.content, "Html.parser")

BODY = Soup.find_all (class_= "x-wiki-content") [0]

html = str (body)

With open ("a.html", ' WB ') as F:

F.write (HTML)

The second step is to parse out all the URLs on the left side of the page. In the same way, find the left menu label <ul class= "Uk-nav uk-nav-side" >

Specific code implementation logic: Because the page has two Uk-nav Uk-nav-side class attribute, and the real directory list is the second one. All URLs get, URL to HTML function is also written in the first step.

Def get_url_list ():

"""

Get list of all URL directories

"""

Response = Requests.get ("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000")

Soup = BeautifulSoup (response.content, "Html.parser")

Menu_tag = Soup.find_all (class_= "Uk-nav uk-nav-side") [1]

URLs = []

For Li in Menu_tag.find_all ("Li"):

url = "http://www.liaoxuefeng.com" + li.a.get (' href ')

Urls.append (URL)

return URLs

The final step is to convert the HTML into a PDF file. Converting to PDF file is very simple, because pdfkit all the logic is encapsulated, you just need to call the function pdfkit.from_file

def save_pdf (HTMLS):

"""

Convert all HTML files to PDF files

"""

Options = {

' Page-size ': ' Letter ',

' Encoding ': "UTF-8",

' Custom-header ': [

(' accept-encoding ', ' gzip ')

]

}

Pdfkit.from_file (htmls, file_name, options=options)

To execute the Save_pdf function, the ebook PDF file is generated:

Summarize

The total amount of code added up to less than 50 lines, but, wait, actually the code given above omitted some details, for example, how to get the title of the article, the content of the IMG tag using the relative path, if you want to display the picture in the PDF will need to change the relative path to absolute path, and the saved HTML The temporary files are deleted and the full code is on GitHub.

Python crawler: Convert liao Xuefeng tutorial to PDF ebook

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.