Friends need to split a PDF file, check the Internet found that this PYPDF2 can complete these operations, so the study of the library, and make some records. First PYPDF2 is the Python3 version, and in the previous 2 version there is a corresponding pypdf library.
You can use the PIP to install directly:
Pip Install PYPDF2
Official documents: https://pythonhosted.org/PyPDF2/
There are mainly these categories:
Pdffilereader.
This class mainly provides read operations on PDF files, which are constructed in the following ways:
Pdffilereader (Stream, Strict=true, Warndest=none, Overwritewarnings=true)
The first parameter can be passed in a file stream, or a file path. The following three parameters are used to set the way the warning is handled, using the default directly.
After you get the instance, you can do something with the PDF. The following are the main operations:
Decrypt (password): If the PDF file is encrypted, you can use this method to decrypt it.
Getdocumentinfo (): Retrieves some information about a PDF file. The return value is a documentinformation type, and the direct output will get a message similar to the following:
{'/moddate ': "d:20150310202949-07 '", '/title ': ", '/creator ': ' LaTeX with Hyperref package ', '/creationdate ':" D : 20150310202949-07 ', '/ptex. Fullbanner ': ' Is pdfTeX, version 3.14159265-2.6-1.40.15 (TeX Live 2014/macports 2014_6) Kpathsea version 6.2.0 ', '/PR Oducer ': ' pdfTeX-1.40.15 ', '/keywords ': ", '/trapped ': '/false ', '/author ':", '/subject ': "}
Getnumpages (): This will be the number of pages in the PDF file.
GetPage (pagenumber): The Page object that will get the corresponding pagenumber number of pages in the PDF file, and the return value is PageObject instance. After the PageObject instance is obtained, it can be added, inserted and so on.
Getpagenumber (page): In contrast to the above method, you can pass in the PageObject instance and get the first page of the PDF file.
Getoutlines (Node=none, Outlines=none): Retrieves the document outline that appears in the document.
isencrypted: Log whether the PDF is encrypted. If the file itself is encrypted, it returns true even after using the decryption decrypt method.
The total number of pages in the Numpages:pdf is equivalent to accessing the read-only property of Getnumpages ().
Pdffilewriter.
This class supports writing to a PDF file, usually by using Pdffilereader to read some PDF data, and then doing something with that class.
You do not need parameters to create an instance of this class.
The main methods are as follows:
AddAttachment (fname, Fdata): Add files to the PDF.
Addblankpage (Width=none, Height=none): Add a blank page to the PDF and at the end, use the size of the last page of the current weiter if you do not specify a size.
AddPage (page): Add page to PDF, usually this page is obtained by the above reader.
Appendpagesfromreader (Reader, After_page_append=none): Copies the data in reader to the current writer instance, and if After_page_append is specified, Finally, the function is returned and the data in the writer is passed in.
Encrypt (User_pwd, Owner_pwd=none, use_128bit=true): Encrypt the PDF, where the official says that USERPWD allows the user to open a PDF file with some restricted permission, which means there may be some restrictions on the use of the password. , but I didn't find the content to set permissions in the document. Ownerpwd, however, allows users unrestricted use. The third parameter is whether to use 128-bit encryption.
Getnumpages (): Get PDF pages.
GetPage (pagenumber): page with the corresponding number of pages, is a PageObject object, you can use the AddPage method above to add the page.
Insertpage (page, index=0): Adds a page to the PDF, where index specifies the position to be inserted.
Write (Stream): writes the contents of the writer to a file.
Pdffilemerger.
This class is used to merge PDF files, which have a constructor with one parameter: Pdffilemerger (strict=true), note that the parameters here are described below:
Common methods:
Addbookmark (title, Pagenum, Parent=none): Add a bookmark to the PDF, title is the caption of the bookmark, Pagenum is the page that the bookmark points to.
Append (Fileobj, Bookmark=none, Pages=none, import_bookmarks=true): Adds the specified fileobj file to the end of the file, and the bookmark is redeemed before pages can be used ( Start, stop[, step]) or a page range to add a page of the specified range in fileobj.
Merge (position, Fileobj, Bookmark=none, Pages=none, import_bookmarks=true): Similar to the Append method, but you can use the position parameter to specify where to add.
Write (Fileobj): Writes data to a file.
When used, you can create a Pdffilemerger instance, and then use append or merge to add the PDF file you want to merge, and then save it with write.
Def merge_pdf ():
# Create an instance for merging files
Pdf_merger = Pdffilemerger ()
# Add a week1_1.pdf file first
pdf_ Merger.append (' week1_1.pdf ')
# then add ex1.pdf file after page No. 0
pdf_merger.merge (0, ' ex1.pdf ')
# Add bookmark
pdf_ Merger.addbookmark (' This is a bookmark ', 1)
# writes it to a file
pdf_merger.write (' merge_pdf.pdf ')
Here's a look at this parameter in Pdffilemerger (strict=true):
Official explanation of this parameter:
Strict (bool) –determines whether user should be warned the all problems and also causes some correctable problems to be f Atal. Defaults to True.
Determine if the user should be warned of all problems, and can also cause some correctable issues.
Just start to feel that this parameter is used to warn users of some errors, directly using the default, but when I try to merge with Chinese PDF, the following error occurred:
Traceback (most recent call last):
File "I:\python3.5\lib\site-packages\PyPDF2\generic.py", line 484, in ReadFromStream return
nameobject (Name.decode (' Utf-8 '))
unicodedecodeerror: ' utf-8 ' codec can ' t decode byte 0xc8 in position 10:invalid continuation byte
During handling of the above exception, another exception occurred:pypdf2.utils.pdfreaderror:illegal character in Name Object
In the source package used UTF decoding error, try to modify the source code here, let it use GBK, but there are other errors. Finally, when you set the strict in the constructor to false, the console prints the following error:
Pdfreadwarning:illegal character in Name Object [generic.py:489]
But two files successfully merged, and probably looked at the merged files sometimes good and bad, the same code to run many times, sometimes normal processing Chinese, but sometimes Chinese garbled.
In addition to the listed methods there are some other methods, such as adding bookmarks, add links, etc., you can refer to the official documentation.
merges, splits, and encrypts PDFs.
The example of encrypting, decrypting, merging, dividing according to the number of pages, dividing according to the numbers:
Use note: If the Chinese file, the results may appear garbled, but run a few times, the middle of the normal display of Chinese problems. The specific reason is not clear, but it is so metaphysical ...
Code Transfer Gate
# @Time : 2018/3/26 23:48
# @Author : Leafage
# @File : handlePDF.py
# @Software: PyCharm
# @Describe: Perform merge, split, and cryptographic operations on pdf files.
From PyPDF2 import PdfFileReader, PdfFileMerger, PdfFileWriter
Def get_reader(filename, password):
Try:
Old_file = open(filename, 'rb')
Except IOError as err:
Print('File open failed!' + str(err))
Return None
#Create a read instance
Pdf_reader = PdfFileReader(old_file, strict=False)
# decryption operation
If pdf_reader.isEncrypted:
If password is None:
Print('%s file is encrypted, requires a password!' % filename)
Return None
Else:
If pdf_reader.decrypt(password) != 1:
Print('%s password is incorrect! ' % filename)
Return None
If old_file in locals():
Old_file.close()
Return pdf_reader
Def encrypt_pdf(filename, new_password, old_password=None, encrypted_filename=None):
"""
Encrypt the file corresponding to filename and generate a new file
:param filename: the path corresponding to the file
:param new_password: The password used to encrypt the file
:param old_password: If the old file is encrypted, a password is required
:param encrypted_filename: The name of the file after encryption, using filename_encrypted;
:return:
"""
# Create a Reader instance
Pdf_reader = get_reader(filename, old_password)
If pdf_reader is None:
Return
# Create an instance of a write operation
Pdf_writer = PdfFileWriter()
# Write data from the previous Reader to the Writer
pdf_writer.appendPagesFromReader(pdf_reader)
# Re-enable with new password
Pdf_writer.encrypt(new_password)
If encrypted_filename is None:
# Use old file name + encrypted as new file name
Encrypted_filename = "".join(filename.split('.')[:-1]) + '_' + 'encrypted' + '.pdf'
Pdf_writer.write(open(encrypted_filename, 'wb'))
Def decrypt_pdf(filename, password, decrypted_filename=None):
"""
Decrypt the encrypted file and retrograde and generate a pdf file without password
:param filename: Originally encrypted pdf file
:param password: the corresponding password
:param decrypted_filename: filename after decryption
:return:
"""
# Generate a Reader and Writer
Pdf_reader = get_reader(filename, password)
If pdf_reader is None:
Return
If not pdf_reader.isEncrypted:
Print('The file is not encrypted, no action!')
Return
Pdf_writer = PdfFileWriter()
pdf_writer.appendPagesFromReader(pdf_reader)
If decrypted_filename is None:
Decrypted_filename = "".join(filename.split('.')[:-1]) + '_' + 'decrypted' + '.pdf'
#Write new file
Pdf_writer.write(open(decrypted_filename, 'wb'))
Def split_by_pages(filename, pages, password=None):
"""
Average the file by page number
:param filename: the name of the file to be split
:param pages: number of pages corresponding to each file after splitting
:param password: If the file is encrypted, decryption is required.
:return:
"""
# Get Reader
Pdf_reader = get_reader(filename, password)
If pdf_reader is None:
Return
# Get the total number of pages
Pages_nums = pdf_reader.numPages
If pages <= 1:
Print('Each file must be larger than 1 page!')
Return
# Get the number of pages per pdf file after splitting
Pdf_num = pages_nums // pages + 1 if pages_nums % pages else int(pages_nums / pages)
Print('pdf file is divided into %d copies, each with %d pages!' % (pdf_num, pages))
# Generate pdf files in turn
For cur_pdf_num in range(1, pdf_num + 1):
# Create a new write instance
Pdf_writer = PdfFileWriter()
# Generate the corresponding file name
Split_pdf_name = "".join(filename)[:-1] + '_' + str(cur_pdf_num) + '.pdf'
# Calculate the current starting position
Start = pages * (cur_pdf_num - 1)
# Calculate the end position, if it is the last one, directly return the last page number, otherwise use the number of pages per page * the number of files already divided
End = pages * cur_pdf_num if cur_pdf_num != pdf_num else pages_nums
# print(str(start) + ',' + str(end))
# Read the corresponding number of pages in turn
For i in range(start, end):
pdf_writer.addPage(pdf_reader.getPage(i))
#Write file
Pdf_writer.write(open(split_pdf_name, 'wb'))
Def split_by_num(filename, nums, password=None):
"""
Divide pdf files into nums
:param filename: filename
:param nums: number of copies to be divided
:param password: If you need to decrypt, enter the password
:return:
"""
Pdf_reader = get_reader(filename, password)
If not pdf_reader:
Return
If nums < 2:
Print('Number of copies cannot be less than 2!')
Return
# Get the total number of pages in pdf
Pages = pdf_reader.numPages
If pages < nums:
Print('The number of copies should not be greater than the total number of pages in pdf!')
Return
# Calculate how many pages should be in each copy
Each_pdf = pages // nums
Print('pdf has %d pages, divided into %d copies, each with %d pages!' % (pages, nums, each_pdf))
For num in range(1, nums + 1):
Pdf_writer = PdfFileWriter()
# Generate the corresponding file name
Split_pdf_name = "".join(filename)[:-1] + '_' + str(num) + '.pdf'
# Calculate the current starting position
Start = each_pdf * (num - 1)
# Calculate the end position, if it is the last one, directly return the last page number, otherwise use the number of pages per page * the number of files already divided
End = each_pdf * num if num != nums else pages
Print(str(start) + ',' + str(end))
For i in range(start, end):
pdf_writer.addPage(pdf_reader.getPage(i))
Pdf_writer.write(open(split_pdf_name, 'wb'))
Def merger_pdf(filenames, merged_name, passwords=None):
"""
Pass in a list of files and combine them in turn
:param filenames: list of files
:param passwords: corresponding password list
:return:
"""
# Calculate how many files are in total
Filenums = len(filenames)
# Note that you need to use the False parameter
Pdf_merger = PdfFileMerger(False)
For i in range(filenums):
# Get password
If passwords is None:
Password = None
eLse:
Password = passwords[i]
Pdf_reader = get_reader(filenames[i], password)
If not pdf_reader:
Return
# append added to the end by default
Pdf_merger.append(pdf_reader)
Pdf_merger.write(open(merged_name, 'wb'))
Def insert_pdf(pdf1, pdf2, insert_num, merged_name, password1=None, password2=None):
"""
Insert all pdf2 files into the insert_num page in pdf1
:param pdf1: pdf1 file name
:param pdf2: pdf2 file name
:param insert_num: number of pages inserted
:param merged_name: the name of the merged file
:param password1: Password corresponding to pdf1
:param password2: Password corresponding to pdf2
:return:
"""
Pdf1_reader = get_reader(pdf1, password1)
Pdf2_reader = get_reader(pdf2, password2)
# If there is one that won’t open, return
If not pdf1_reader or not pdf2_reader:
Return
# Get the total number of pages in pdf1
Pdf1_pages = pdf1_reader.numPages
If insert_num < 0 or insert_num > pdf1_pages:
Print('The insertion position is abnormal, the number of pages you want to insert is: %d, pdf1 file total: %d page! ' % (insert_num, pdf1_pages))
Return
# Note that you need to use the False parameter, there may be Chinese garbled situation.
M_pdf = PdfFileMerger(False)
M_pdf.append(pdf1)
M_pdf.merge(insert_num, pdf2)
M_pdf.write(open(merged_name, 'wb'))
If __name__ == '__main__':
# encrypt_pdf('ex1.pdf', 'leafage')
# decrypt_pdf('ex1123_encrypted.pdf', 'leafage')
# split_by_pages('ex1.pdf', 5)
Split_by_num('ex2.pdf', 3)
# merger_pdf(['ex1.pdf', 'ex2.pdf'], 'merger.pdf')
# insert_pdf('ex1.pdf', 'ex2.pdf', 10, 'pdf12.pdf')