[Python learning] to emulate the browser download csdn source text and to achieve a PDF format backup

Source: Internet
Author: User
Tags define function define get 403 forbidden error wkhtmltopdf nltk


recently suddenly want to give their own blog backup, looked at two software: one is CSDN blog export software, it seems that can not be used now; one is the bean John Blog backup experts, feeling are too slow, and not flexible, want to separate next article is more time-consuming. And my graduation thesis is based on Python's natural language-related, so I want to combine the previous article with Python to achieve a simple function:
1. Download the ontology blog via the Internet, including images;
2. Convert HTML into PDF format via python;
3. If possible, you may later write an article that takes care of the code in a specific way.
on the other hand, directly on the code through two aspects of the explanation.

I. Set message headers download CSDN article content


Get an article on Python's code as follows, such as Han's Sina Blog: (The final summary of the article has my previous about the Python Crawler blog link introduction)


import urllib
content = urllib.urlopen ("http://blog.sina.com.cn/s/blog_4701280b0102eo83.html") .read ()
open ('blog.html', 'w +'). write (content)
        However, many websites prevent this method of acquisition. For example, CSDN returns the following html code: "403 Forbidden error":

<html>
<head> <title> 403 Forbidden </ title> </ head>
<body bgcolor = "white">
<center> <h1> 403 Forbidden </ h1> </ center>
<hr> <center> nginx </ center>
</ body>
</ html>
        At this time, you can pretend to be a browser to download by setting a message header or impersonating a login. code show as below:
#coding: utf-8
import urllib
import urllib2
import cookielib
import string
import time
import re
import sys

#Define class to implement simulated landing download HTML
class GetInfoByBrowser:

    #Initialization operation
    #Common Error: AttributeError: .. instance has no attribute 'opener' is double underscore
    def __init __ (self):
       socket.setdefaulttimeout (20)
       self.headers = {'User-Agent': 'Mozilla / 5.0 (Windows NT 6.3; WOW64; rv: 28.0) Gecko / 20100101 Firefox / 28.0'}
       self.cookie_support = urllib2.HTTPCookieProcessor (cookielib.CookieJar ())
       self.opener = urllib2.build_opener (self.cookie_support, urllib2.HTTPHandler)

    #Define function to simulate login
    def openurl (self, url):
        urllib2.install_opener (self.opener)
        self.opener.addheaders = [("User-agent", self.headers), ("Accept", "* / *"), ('Referer', 'http: //www.google.com')]
        try:
            result = self.opener.open (url)
            content = result.read ()
            open ('openurl.html', 'w +'). write (content)
            print content
            print 'Open Succeed !!!'
        except Exception, e:
            print "Exception:", e
        else:
            return result

    #Define Get Request Add a request message header, disguised as a browser
    def geturl (self, get_url):
        result = ""
        try:
            req = urllib2.Request (url = get_url, headers = self.headers)
            result = urllib2.urlopen (req) .read ()
            open ('geturl.html', 'w +'). write (result)
            type = sys.getfilesystemencoding ()
            print result.decode ("UTF-8"). encode (type) # Prevent Chinese garbled characters
            print 'Get Succeed !!!'
        except Exception, e:
            print "Exception:", e
        else:
            return result

#Call this class to get HTML
print unicode ('call analog landing function openurl:', 'utf-8')
print unicode ('First method openurl:', 'utf-8')
getHtml = GetInfoByBrowser ()
getHtml.openurl ("http://blog.csdn.net/eastmount/article/details/39770543")

print unicode ('Second method geturl:', 'utf-8')
getHtml.geturl ("http://blog.csdn.net/eastmount/article/details/39770543")
        The effect is to download my article "[Python 学] simple web crawler crawl blog articles and ideas", the two methods have the same effect, the two files geturl.html and openurl.html. This method runs Python-defined classes, functions, urllib2, and cookielib related knowledge.



        Three similar excellent articles are recommended, and the POST method is similar:
        [Python] Let's write a Python crawler tool class whyspider——Wang Hai
        Write a crawler in python to crawl the content of csdn, perfect solution 403 Forbidden
        urllib2.HTTPError: HTTP Error 403: Forbidden

2. Implement HTML to PDF backup articles
         First of all declare: this part of the code implementation ultimately ends in failure, and may continue to study in the future, on the one hand because of the recent busy; on the other hand, lack of Linux and insufficient grasp of Python, but still want to write this part, I feel Some things may help you too! I feel so sorry ~

1. Turn to PDF solution Through checking the information online, we found the two most common ways to call the Python library to turn PDF:
        Method 1: Call PDF report library Reportlab, which is an online website to PDF
        This library does not belong to Python's standard class library, so you must manually download and install the class library package. At the same time, because it involves converting pictures to PDF, you also need Python imaging library (PIL) class library.
        Reference article: Python to grab HTML, extract data, analyze, and draw PDF graphics

        Method 2: HTML to PDF by calling the xhtml2pdf and pisa libraries
        This method can convert static HTML to PDF format. The core code is as follows. The local "1.html" static interface is converted to "test.pdf". The method I try to take is the same method.
#-*-coding: utf-8-*-
import sx.pisa3 as pisa
data = open ('1.htm'). read ()
result = file ('test.pdf', 'wb')
pdf = pisa.CreatePDF (data, result)
result.close ()
pisa.startViewer ('test.pdf')
        Reference article: Python implementation code to convert HTML to PDF (including Chinese)

        Method three: call a third-party wkhtmltopdf software implementation
        This method does not have as detailed code as Python calls a third party, and many articles are implemented based on input commands. The following three articles are about the implementation of wkhtmltopdf.
        Reference article: HTML to PDF tool: wkhtmltopdf
                         [php] The solution to batch html to pdf file research
                         wkhtmltopdf generate pdf with cover, header, footer, table of contents

2. Install PIP and introduction At this time, we are going to introduce the function of HTML to PDF through xhtml2pdf and pisa library. First, we need to install PIP software. As xifeijian said: "As a Python enthusiast, if you don't know any of easy_install or pip, then ...".
        The role of easy_insall is similar to cpan in perl and gem in ruby, both provide a fool-friendly way to install modules online, and pip is an improved version of easy_install, which provides better prompt information and deletes package and other functions. Older versions of python only have easy_install and no pip. The common specific usage is as follows:
easy_install
1) Install a package
 $ easy_install <package_name>
 $ easy_install "<package_name> == <version>"
2) Upgrade a package
 $ easy_install -U "<package_name >> = <version>"

pip
1) Install a package
 $ pip install <package_name>
 $ pip install <package_name> == <version>
2) Upgrade a package (if no version number is provided, upgrade to the latest version)
 $ pip install --upgrade <package_name >> = <version>
3) delete a package
 $ pip uninstall <package_name>
        Step 1: Download the PIP software
        You can download it from the official website http://pypi.python.org/pypi/pip#downloads. At the same time, cd to the PIP directory and install it through python setup.py install. And I used to download pip-Win_1.7.exe for installation, as follows:
        https://sites.google.com/site/pydatalog/python/pip-for-windows
        Step 2: Install PIP software



        When the prompt "pip and virtualenv installed" indicates that the installation was successful, how can I test that the PIP installation was successful?
        Step 3: Configure environment variables
        At this point, entering the pip command in cmd will prompt the error "Not an internal or external command", so you need to add the path environment variable. After the PIP installation is complete, the python \ Scripts directory will be added to the Python installation directory, that is, under the Scripts directory of the python installation directory, add this directory to the environment variables! The process is as follows:



        Step 4: Use PIP commands
        The following uses the PIP command in the CMD, and "pip list outdate" lists the version information of the Python installation library.

       The commands commonly used in PIP are as follows: (refer to pip installation and use)
Usage:
  pip <command> [options]
 
Commands:
  install Install the software.
  uninstall Uninstall the software.
  freezeOutput a list of installed software in a certain format
  list Lists installed software.
  show displays software details.
  search Search software, similar to search in yum.
  wheel Build wheels from your requirements.
  zip is not recommended. Zip individual packages.
  unzip is not recommended. Unzip individual packages.
  Not recommended for bundles. Create pybundles.
  help current help.
 
General Options:
  -h, --help show help.
  -v, --verbose More output, can be used up to 3 times
  -V, --version display version information and exit.
  -q, --quiet minimal output.
  --log-file <path> Overwrite the verbose error log. The default file is /root/.pip/pip.log
  --log <path> does not overwrite the log recording verbose output.
  --proxy <proxy> Specify a proxy in the form [user: [email protected]] proxy.server: port.
  --timeout <sec> connection timeout (default 15 seconds).
  --exists-action <action> The default activity when a path always exists: (s) witch, (i) gnore, (w) ipe, (b) ackup.
  --cert <path> certificate.

3. Install xhtml2pdf and pisa software Install xhtml2pdf and pisa library via PIP command. :
        xhtml2pdf 0.0.6: https://pypi.python.org/pypi/xhtml2pdf/
        pisa 3.0.33: https://pypi.python.org/pypi/pisa/
        Then install it with:
            pip install xhtml2pdf
            pip install pisa


        reference:
        Install the python library for converting html5 to pdf pisa Install the Python library for converting matplotlab data to graphics

4. Reasons for failure The code that runs HTML to PDF when the Pisa library is not installed initially will report an error:
                >>>
                Traceback (most recent call last):
                File "G: / software / Program software / Python / python insert / HtmlToPDF.py", line 12, in <module>
                ImportError: No module named sx.pisa3
        After the installation is completed, it will not prompt that the import library name does not exist, but the HTML to PDF code will report an error at this time:
********************************************** **
IMPORT ERROR!
Reportlab Version 2.1+ is needed!
********************************************** **

The following Python packages are required for PISA:
-Reportlab Toolkit> = 2.2 <http://www.reportlab.org/>
-HTML5lib> = 0.11.1 <http://code.google.com/p/html5lib/>

Optional packages:
-pyPDF <http://pybrary.net/pyPdf/>
-PIL <http://www.pythonware.com/products/pil/>

Traceback (most recent call last):
  File "G: \ software \ Program software \ Python \ python insert \ HtmlToPDF.py", line 5, in <module>
    import sx.pisa3 as pisa
...

raise ImportError ("Reportlab Version 2.1+ is needed!")
ImportError: Reportlab Version 2.1+ is needed!
        The reason is that when importing "import sx.pisa3 as pisa", the Reportlab version needs to be greater than 2.1. The code version is 3.1.44.

>>> import reportlab
>>> print reportlab.Version
3.1.44
>>>
        Checking a lot of information did not solve the problem. The most typical one is to modify the code in the sx \ pisa3 \ pisa_util.py file in the pisa installation directory:
if not (reportlab.Version [0] == "2" and reportlab.Version [2]> = "1"):
    raise ImportError ("Reportlab Version 2.1+ is needed!")

REPORTLAB22 = (reportlab.Version [0] == "2" and reportlab.Version [2]> = "2")
        The modified code is as follows:
if not (reportlab.Version [: 3]> = "2.1"):
    raise ImportError ("Reportlab Version 2.1+ is needed!")

REPORTLAB22 = (reportlab.Version [: 3]> = "2.1")

        But still couldn't solve the problem, which made me unable to verify the code and implement the function of converting HTML to PDF later. See a lot of foreign materials:
        xhtml2pdf ImportError-Django from stackoverflow
        https://github.com/stephenmcd/cartridge/issues/174
        https://groups.google.com/forum/#!topic/xhtml2pdf/mihS51DtZkU
        http://linux.m2osw.com/xhtml2pdf-generating-error-under-1404


Three. Summary
        Finally, a brief summary! The main purpose of this article is to download articles in the form of HTML static web pages from CSDN, and then use the third-party Python library to implement the backup article function of converting to PDF format, but Pisa failed to import due to the failure. You may be very disappointed and I am sorry. But I can still learn a few things from the article, including:
        1. How to get 403 prohibited content through Python, write message header to imitate login, and use two methods: geturl and openurl.
        2. How to configure PIP, which allows us to install third-party libraries more conveniently and let you understand some configuration processes;
        3. Let you understand some ideas of HTML to PDF.
        Finally, I recommend my previous crawler article on Python, which may provide you with some ideas. Although it is much worse than those open source software, there are relatively few articles and resources in this area, even if it gives you a little inspiration.
        [Python learning] Topic 1. Basic knowledge of functions
        [Python learning] Topic 2: Basic knowledge of conditional statements and loop statements
        [Python learning] Topic III. Basic knowledge of strings
        [Python learning] Simple web crawler crawls blog posts and ideas
        [python 学] Simple crawling Wikipedia program language message box
        [python 学] Simple crawling pictures in the picture website gallery
        [python knowledge] BeautifulSoup library installation and brief introduction of crawler knowledge
        [python + nltk] A brief introduction to natural language processing and NLTK environment configuration and entry knowledge (1)

        If you have a "Reportlab Version 2.1+ is needed!" Good solution can tell me, I am grateful. Concentrate on studying and researching this aspect of the function, it is better not to call a third-party library and cheer for yourself.
        Finally, I hope the article is helpful to you. If there are deficiencies or errors, please also Hai Han ~
        (By: Eastmount 2015-5-17 at 3 am http://blog.csdn.net/eastmount/)

[python 学] Download the CSDN source and simulate PDF backup


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.