Python: Use PYPDF2 to merge, Split, and encrypt PDF files. __python

Source: Internet
Author: User
Tags decrypt blank page


Friends need to split a PDF file, check the Internet found that this PYPDF2 can complete these operations, so the study of the library, and make some records. First PYPDF2 is the Python3 version, and in the previous 2 version there is a corresponding pypdf library.



You can use the PIP to install directly:





Pip Install PYPDF2


Official documents: https://pythonhosted.org/PyPDF2/



There are mainly these categories:




Pdffilereader.



This class mainly provides read operations on PDF files, which are constructed in the following ways:





Pdffilereader (Stream, Strict=true, Warndest=none, Overwritewarnings=true)


The first parameter can be passed in a file stream, or a file path. The following three parameters are used to set the way the warning is handled, using the default directly.



After you get the instance, you can do something with the PDF. The following are the main operations:



Decrypt (password): If the PDF file is encrypted, you can use this method to decrypt it.



Getdocumentinfo (): Retrieves some information about a PDF file. The return value is a documentinformation type, and the direct output will get a message similar to the following:



{'/moddate ': "d:20150310202949-07 '", '/title ': ", '/creator ': ' LaTeX with Hyperref package ', '/creationdate ':" D : 20150310202949-07 ', '/ptex. Fullbanner ': ' Is pdfTeX, version 3.14159265-2.6-1.40.15 (TeX Live 2014/macports 2014_6) Kpathsea version 6.2.0 ', '/PR Oducer ': ' pdfTeX-1.40.15 ', '/keywords ': ", '/trapped ': '/false ', '/author ':", '/subject ': "}



Getnumpages (): This will be the number of pages in the PDF file.



GetPage (pagenumber): The Page object that will get the corresponding pagenumber number of pages in the PDF file, and the return value is PageObject instance. After the PageObject instance is obtained, it can be added, inserted and so on.



Getpagenumber (page): In contrast to the above method, you can pass in the PageObject instance and get the first page of the PDF file.



Getoutlines (Node=none, Outlines=none): Retrieves the document outline that appears in the document.



isencrypted: Log whether the PDF is encrypted. If the file itself is encrypted, it returns true even after using the decryption decrypt method.



The total number of pages in the Numpages:pdf is equivalent to accessing the read-only property of Getnumpages (). 

Pdffilewriter.



This class supports writing to a PDF file, usually by using Pdffilereader to read some PDF data, and then doing something with that class.



You do not need parameters to create an instance of this class.



The main methods are as follows:



AddAttachment (fname, Fdata): Add files to the PDF.



Addblankpage (Width=none, Height=none): Add a blank page to the PDF and at the end, use the size of the last page of the current weiter if you do not specify a size.



AddPage (page): Add page to PDF, usually this page is obtained by the above reader.



Appendpagesfromreader (Reader, After_page_append=none): Copies the data in reader to the current writer instance, and if After_page_append is specified, Finally, the function is returned and the data in the writer is passed in.



Encrypt (User_pwd, Owner_pwd=none, use_128bit=true): Encrypt the PDF, where the official says that USERPWD allows the user to open a PDF file with some restricted permission, which means there may be some restrictions on the use of the password. , but I didn't find the content to set permissions in the document. Ownerpwd, however, allows users unrestricted use. The third parameter is whether to use 128-bit encryption.



Getnumpages (): Get PDF pages.



GetPage (pagenumber): page with the corresponding number of pages, is a PageObject object, you can use the AddPage method above to add the page.



Insertpage (page, index=0): Adds a page to the PDF, where index specifies the position to be inserted.



Write (Stream): writes the contents of the writer to a file. 

Pdffilemerger.



This class is used to merge PDF files, which have a constructor with one parameter: Pdffilemerger (strict=true), note that the parameters here are described below:



Common methods:



Addbookmark (title, Pagenum, Parent=none): Add a bookmark to the PDF, title is the caption of the bookmark, Pagenum is the page that the bookmark points to.



Append (Fileobj, Bookmark=none, Pages=none, import_bookmarks=true): Adds the specified fileobj file to the end of the file, and the bookmark is redeemed before pages can be used ( Start, stop[, step]) or a page range to add a page of the specified range in fileobj.



Merge (position, Fileobj, Bookmark=none, Pages=none, import_bookmarks=true): Similar to the Append method, but you can use the position parameter to specify where to add.



Write (Fileobj): Writes data to a file.



When used, you can create a Pdffilemerger instance, and then use append or merge to add the PDF file you want to merge, and then save it with write.





Def merge_pdf ():
    # Create an instance for merging files
    Pdf_merger = Pdffilemerger ()

    # Add a week1_1.pdf file first
    pdf_ Merger.append (' week1_1.pdf ')
    # then add ex1.pdf file after page No. 0
    pdf_merger.merge (0, ' ex1.pdf ')
    # Add bookmark
    pdf_ Merger.addbookmark (' This is a bookmark ', 1)
    # writes it to a file
    pdf_merger.write (' merge_pdf.pdf ')


Here's a look at this parameter in Pdffilemerger (strict=true):



Official explanation of this parameter:



Strict (bool) –determines whether user should be warned the all problems and also causes some correctable problems to be f Atal. Defaults to True.



Determine if the user should be warned of all problems, and can also cause some correctable issues.



Just start to feel that this parameter is used to warn users of some errors, directly using the default, but when I try to merge with Chinese PDF, the following error occurred:





Traceback (most recent call last):
  File "I:\python3.5\lib\site-packages\PyPDF2\generic.py", line 484, in ReadFromStream return
    nameobject (Name.decode (' Utf-8 '))
unicodedecodeerror: ' utf-8 ' codec can ' t decode byte 0xc8 in position 10:invalid continuation byte

During handling of the above exception, another exception occurred:pypdf2.utils.pdfreaderror:illegal character in Name Object


In the source package used UTF decoding error, try to modify the source code here, let it use GBK, but there are other errors. Finally, when you set the strict in the constructor to false, the console prints the following error:





Pdfreadwarning:illegal character in Name Object [generic.py:489]


But two files successfully merged, and probably looked at the merged files sometimes good and bad, the same code to run many times, sometimes normal processing Chinese, but sometimes Chinese garbled.



In addition to the listed methods there are some other methods, such as adding bookmarks, add links, etc., you can refer to the official documentation. 


merges, splits, and encrypts PDFs.



The example of encrypting, decrypting, merging, dividing according to the number of pages, dividing according to the numbers:



Use note: If the Chinese file, the results may appear garbled, but run a few times, the middle of the normal display of Chinese problems. The specific reason is not clear, but it is so metaphysical ...



Code Transfer Gate





# @Time : 2018/3/26 23:48
# @Author : Leafage
# @File : handlePDF.py
# @Software: PyCharm
# @Describe: Perform merge, split, and cryptographic operations on pdf files.
From PyPDF2 import PdfFileReader, PdfFileMerger, PdfFileWriter


Def get_reader(filename, password):
    Try:
        Old_file = open(filename, 'rb')
    Except IOError as err:
        Print('File open failed!' + str(err))
        Return None

    #Create a read instance
    Pdf_reader = PdfFileReader(old_file, strict=False)

    # decryption operation
    If pdf_reader.isEncrypted:
        If password is None:
            Print('%s file is encrypted, requires a password!' % filename)
            Return None
        Else:
            If pdf_reader.decrypt(password) != 1:
                Print('%s password is incorrect! ' % filename)
                Return None
    If old_file in locals():
        Old_file.close()
    Return pdf_reader


Def encrypt_pdf(filename, new_password, old_password=None, encrypted_filename=None):
    """
    Encrypt the file corresponding to filename and generate a new file
    :param filename: the path corresponding to the file
    :param new_password: The password used to encrypt the file
    :param old_password: If the old file is encrypted, a password is required
    :param encrypted_filename: The name of the file after encryption, using filename_encrypted;
    :return:
    """
    # Create a Reader instance
    Pdf_reader = get_reader(filename, old_password)

    If pdf_reader is None:
        Return

    # Create an instance of a write operation
    Pdf_writer = PdfFileWriter()
    # Write data from the previous Reader to the Writer
    pdf_writer.appendPagesFromReader(pdf_reader)

    # Re-enable with new password
    Pdf_writer.encrypt(new_password)

    If encrypted_filename is None:
        # Use old file name + encrypted as new file name
        Encrypted_filename = "".join(filename.split('.')[:-1]) + '_' + 'encrypted' + '.pdf'

    Pdf_writer.write(open(encrypted_filename, 'wb'))


Def decrypt_pdf(filename, password, decrypted_filename=None):
    """
    Decrypt the encrypted file and retrograde and generate a pdf file without password
    :param filename: Originally encrypted pdf file
    :param password: the corresponding password
    :param decrypted_filename: filename after decryption
    :return:
    """

    # Generate a Reader and Writer
    Pdf_reader = get_reader(filename, password)
    If pdf_reader is None:
        Return
    If not pdf_reader.isEncrypted:
        Print('The file is not encrypted, no action!')
        Return
    Pdf_writer = PdfFileWriter()

    pdf_writer.appendPagesFromReader(pdf_reader)

    If decrypted_filename is None:
        Decrypted_filename = "".join(filename.split('.')[:-1]) + '_' + 'decrypted' + '.pdf'

    #Write new file
    Pdf_writer.write(open(decrypted_filename, 'wb'))


Def split_by_pages(filename, pages, password=None):
    """
    Average the file by page number
    :param filename: the name of the file to be split
    :param pages: number of pages corresponding to each file after splitting
    :param password: If the file is encrypted, decryption is required.
    :return:
    """
    # Get Reader
    Pdf_reader = get_reader(filename, password)
    If pdf_reader is None:
        Return
    # Get the total number of pages
    Pages_nums = pdf_reader.numPages

    If pages <= 1:
        Print('Each file must be larger than 1 page!')
        Return

    # Get the number of pages per pdf file after splitting
    Pdf_num = pages_nums // pages + 1 if pages_nums % pages else int(pages_nums / pages)

    Print('pdf file is divided into %d copies, each with %d pages!' % (pdf_num, pages))

    # Generate pdf files in turn
    For cur_pdf_num in range(1, pdf_num + 1):
        # Create a new write instance
        Pdf_writer = PdfFileWriter()
        # Generate the corresponding file name
        Split_pdf_name = "".join(filename)[:-1] + '_' + str(cur_pdf_num) + '.pdf'
        # Calculate the current starting position
        Start = pages * (cur_pdf_num - 1)
        # Calculate the end position, if it is the last one, directly return the last page number, otherwise use the number of pages per page * the number of files already divided
        End = pages * cur_pdf_num if cur_pdf_num != pdf_num else pages_nums
        # print(str(start) + ',' + str(end))
        # Read the corresponding number of pages in turn
        For i in range(start, end):
            pdf_writer.addPage(pdf_reader.getPage(i))
        #Write file
        Pdf_writer.write(open(split_pdf_name, 'wb'))


Def split_by_num(filename, nums, password=None):
    """
    Divide pdf files into nums
    :param filename: filename
    :param nums: number of copies to be divided
    :param password: If you need to decrypt, enter the password
    :return:
    """
    Pdf_reader = get_reader(filename, password)
    If not pdf_reader:
        Return

    If nums < 2:
        Print('Number of copies cannot be less than 2!')
        Return

    # Get the total number of pages in pdf
    Pages = pdf_reader.numPages

    If pages < nums:
        Print('The number of copies should not be greater than the total number of pages in pdf!')
        Return

    # Calculate how many pages should be in each copy
    Each_pdf = pages // nums

    Print('pdf has %d pages, divided into %d copies, each with %d pages!' % (pages, nums, each_pdf))

    For num in range(1, nums + 1):
        Pdf_writer = PdfFileWriter()
        # Generate the corresponding file name
        Split_pdf_name = "".join(filename)[:-1] + '_' + str(num) + '.pdf'
        # Calculate the current starting position
        Start = each_pdf * (num - 1)
        # Calculate the end position, if it is the last one, directly return the last page number, otherwise use the number of pages per page * the number of files already divided
        End = each_pdf * num if num != nums else pages
        Print(str(start) + ',' + str(end))
        For i in range(start, end):
            pdf_writer.addPage(pdf_reader.getPage(i))
        Pdf_writer.write(open(split_pdf_name, 'wb'))


Def merger_pdf(filenames, merged_name, passwords=None):
    """
    Pass in a list of files and combine them in turn
    :param filenames: list of files
    :param passwords: corresponding password list
    :return:
    """
    # Calculate how many files are in total
    Filenums = len(filenames)
    # Note that you need to use the False parameter
    Pdf_merger = PdfFileMerger(False)

    For i in range(filenums):
        # Get password
        If passwords is None:
            Password = None
        eLse:
            Password = passwords[i]
        Pdf_reader = get_reader(filenames[i], password)
        If not pdf_reader:
            Return
        # append added to the end by default
        Pdf_merger.append(pdf_reader)

    Pdf_merger.write(open(merged_name, 'wb'))


Def insert_pdf(pdf1, pdf2, insert_num, merged_name, password1=None, password2=None):
    """
    Insert all pdf2 files into the insert_num page in pdf1
    :param pdf1: pdf1 file name
    :param pdf2: pdf2 file name
    :param insert_num: number of pages inserted
    :param merged_name: the name of the merged file
    :param password1: Password corresponding to pdf1
    :param password2: Password corresponding to pdf2
    :return:
    """
    Pdf1_reader = get_reader(pdf1, password1)
    Pdf2_reader = get_reader(pdf2, password2)

    # If there is one that won’t open, return
    If not pdf1_reader or not pdf2_reader:
        Return
    # Get the total number of pages in pdf1
    Pdf1_pages = pdf1_reader.numPages
    If insert_num < 0 or insert_num > pdf1_pages:
        Print('The insertion position is abnormal, the number of pages you want to insert is: %d, pdf1 file total: %d page! ' % (insert_num, pdf1_pages))
        Return
    # Note that you need to use the False parameter, there may be Chinese garbled situation.
    M_pdf = PdfFileMerger(False)
    M_pdf.append(pdf1)
    M_pdf.merge(insert_num, pdf2)
    M_pdf.write(open(merged_name, 'wb'))


If __name__ == '__main__':
    # encrypt_pdf('ex1.pdf', 'leafage')
    # decrypt_pdf('ex1123_encrypted.pdf', 'leafage')
    # split_by_pages('ex1.pdf', 5)
    Split_by_num('ex2.pdf', 3)
    # merger_pdf(['ex1.pdf', 'ex2.pdf'], 'merger.pdf')
    # insert_pdf('ex1.pdf', 'ex2.pdf', 10, 'pdf12.pdf')
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.