Learn more about Python parsing and reading pdf file content

Source: Internet
Author: User



This article focuses on Python parsing and reading pdf file content, including the application of the Learning Library, python2.7 and python3.6 in the Python parsing pdf file Content Library updates, including the Pdfminer library detailed interpretation and application. The main reference is some of the existing blog content, code.



The main idea is to first take the form of a project, describe the problem, run the environment, and need to install the library, and then write the code, This code runs in python2.7 and then writes out the code that runs in python3.6 and explains in detail some of the differences between the Python libraries in python2.7 and python3.6, and finally explains the meaning of the Code, and the idea of the library, the ultimate purpose of which we understand and learn to apply Pytho n the method of parsing and reading the contents of a PDF file.


One, the problem description


Read PDF text content with Python


Second, the operating environment


Python 3.6


Three, libraries that need to be installed
Pip Install Pdfminer




Four, implement the source code (where code 1 and code 2 are all implemented by python2.7)


Code 1 (Win64)


# coding = utf-8
import sys
reload (sys)
sys.setdefaultencoding (‘utf-8’)
import time
time1 = time.time ()
import os.path
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal, LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed
result = []
class CPdf2TxtManager ():
  def __init __ (self):
    ‘‘ ‘‘ ‘‘
    Constructor
    ‘‘ ‘
  def changePdfToText (self, filePath):
    file = open (path, ‘rb‘) # Open in binary read mode
    #Create a pdf document analyzer with a file object
    praser = PDFParser (file)
    # Create a PDF document
    doc = PDFDocument ()
    # Connect the parser to the document object
    praser.set_document (doc)
    doc.set_parser (praser)
    # Provide initialization password
    # If there is no password, create an empty string
    doc.initialize ()
    # Check if the document provides txt conversion, ignore it if not provided
    if not doc.is_extractable:
      raise PDFTextExtractionNotAllowed
    # Create PDf resource manager to manage shared resources
    rsrcmgr = PDFResourceManager ()
    # Create a PDF device object
    laparams = LAParams ()
    device = PDFPageAggregator (rsrcmgr, laparams = laparams)
    # Create a PDF interpreter object
    interpreter = PDFPageInterpreter (rsrcmgr, device)
    pdfStr = ‘‘
    # Loop through the list, processing the content of one page at a time
    for page in doc.get_pages (): # doc.get_pages () Get page list
      interpreter.process_page (page)
      # Accept the LTPage object of this page
      layout = device.get_result ()
      for x in layout:
        if hasattr (x, "get_text"):
          # print x.get_text ()
          result.append (x.get_text ())
          fileNames = os.path.splitext (filePath)
          with open (fileNames [0] + ‘.txt’, ‘wb’) as f:
            results = x.get_text ()
            print (results)
            f.write (results + ‘\ n’)
if __name__ == ‘__main__’:
  ‘‘ ‘‘ ‘‘
   Parse pdf text and save to txt file
  ‘‘ ‘
  path = u‘C: /data3.pdf ’
  pdf2TxtManager = CPdf2TxtManager ()
  pdf2TxtManager.changePdfToText (path)
  # print result [0]
  time2 = time.time ()
  print u‘ok, end of parsing pdf! ’
  print u ’total time:‘ + str (time2-time1) + ‘s’





Code 2 (WIN32)


# coding = utf-8
import sys
reload (sys)
sys.setdefaultencoding (‘utf-8’)
import time
time1 = time.time ()
import os.path
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
result = []
class CPdf2TxtManager ():
  def __init __ (self):
    ‘‘ ‘‘ ‘‘
    Constructor
    ‘‘ ‘
  def changePdfToText (self, filePath):
    file = open (path, ‘rb‘) # Open in binary read mode
    #Create a pdf document analyzer with a file object
    praser = PDFParser (file)
    # Create a PDF document
    doc = PDFDocument (praser)
    # Check if the document provides txt conversion, ignore it if not provided
    if not doc.is_extractable:
      raise PDFTextExtractionNotAllowed
    # Create PDf resource manager to manage shared resources
    rsrcmgr = PDFResourceManager ()
    # Create a PDF device object
    laparams = LAParams ()
    device = PDFPageAggregator (rsrcmgr, laparams = laparams)
    # Create a PDF interpreter object
    interpreter = PDFPageInterpreter (rsrcmgr, device)
    pdfStr = ‘‘
    # Loop through the list, processing the content of one page at a time
    for page in PDFPage.create_pages (doc): # doc.get_pages ()
      interpreter.process_page (page)
      # Accept the LTPage object of this page
      layout = device.get_result ()
      for x in layout:
        if hasattr (x, "get_text"):
          # print x.get_text ()
          result.append (x.get_text ())
          fileNames = os.path.splitext (filePath)
          with open (fileNames [0] + ‘.txt’, ‘wb’) as f:
            results = x.get_text ()
            print (results)
            f.write (results + ‘\ n’)
if __name__ == ‘__main__’:
  ‘‘ ‘‘ ‘‘
   Parse pdf text and save to txt file
  ‘‘ ‘
  path = u‘C: /36.pdf ’
  pdf2TxtManager = CPdf2TxtManager ()
  pdf2TxtManager.changePdfToText (path)
  # print result [0]
  time2 = time.time ()
  print u‘ok, end of parsing pdf! ’
  print u ’total time:‘ + str (time2-time1) + ‘s’




Five, how to improve the code problem of python2.7 implementation in python3.6, reload improvement


The above is the python2, but in Python3 this need no longer exist, so do not have any practical significance.



In python2.x because there is no obvious difference between STR and byte, it is often dependent on defaultencoding to do the conversion.
In Python3, there is a clear distinction between STR and byte types, from one type to another to explicitly specify encoding.



However, you can still use this method instead


Import Importlib,sys importlib.reload (SYS)




Issue two, installation of the Pdfminer module


Can be installed directly in the python2.7


Pip Install Pdfminer


Installation is required in python3.6


Pip Install pdfminer3k




Six python3.6 of source code
import pyocr
import importlib
import sys
import time

importlib.reload (sys)
time1 = time.time ()
# print ("Initial time is:", time1)

import os.path
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal, LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed

text_path = r‘words-words.pdf ’
# text_path = r‘photo-words.pdf ’

def parse ():
    ‘‘ ‘Parse PDF text and save it to a TXT file’ ‘’
    fp = open (text_path, ‘rb‘)
    #Create a PDF Document Analyzer with File Object
    parser = PDFParser (fp)
    #Create a PDF document
    doc = PDFDocument ()
    #Connect the parser with the document object
    parser.set_document (doc)
    doc.set_parser (parser)

    #Provide the initialization password, if there is no password, create an empty string
    doc.initialize ()

    #Detect if the document provides txt conversion, ignore it if not provided
    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        #Create PDF, Explorer, to share resources
        rsrcmgr = PDFResourceManager ()
        #Create a PDF device object
        laparams = LAParams ()
        device = PDFPageAggregator (rsrcmgr, laparams = laparams)
        #Create a PDF explaining its objects
        interpreter = PDFPageInterpreter (rsrcmgr, device)

        #Loop through the list, processing one page content at a time
        # doc.get_pages () Get page list
        for page in doc.get_pages ():
            interpreter.process_page (page)
            #Accept the LTPage object of this page
            layout = device.get_result ()
            # Here layout is an LTPage object which stores various objects parsed by this page
            # Generally include LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal, etc.
            # If you want to get the text, get the text property of the object,
            for x in layout:
                if (isinstance (x, LTTextBoxHorizontal)):
                    with open (r‘2.txt ’,‘ a ’) as f:
                        results = x.get_text ()
                        print (results)
                        f.write (results + "\ n")

if __name__ == ‘__main__’:
    parse ()
    time2 = time.time ()
    print ("Total elapsed time:", time2-time1)




Seven Python read pdf document code Analysis


The PDF format is not a canonical format. Although it is called "PDF document", it is not like Word or HTML document. The PDF behaves more like a picture. The PDF is more like placing the content in every exact position on a piece of paper. In most cases, there is no logical structure, such as a sentence or paragraph, and cannot be adjusted to fit the page size. Pdfminer try to reconstruct their structure by guessing their layout, but there is no guarantee that they will work. I know it's hard to see, but PDFs are really not prescriptive.



The following picture is the use of the process description, which we will break down to see








Because PDF files have such a large and complex structure, complete analysis of PDF files is time-consuming and laborious.
Well, in most PDF jobs, many modules do not need to be added. So PDFMiner
A lazy analysis strategy is adopted, which is to analyze only what is needed. When parsing, at least
Requires 2 core classes, PDFParser and PDFDocument. These two modules cooperate with the other
Module to use.



PDFParser gets data from a file

PDFDocument stores document data structures in memory

PDFPageInterpreter parses page content

PDFDevice turns the parsed content into what you need

PDFResourceManager stores shared resources, such as fonts or pictures











First use theopenmethod orurlopenopen the document or network document (usually do this because the document is too large, the burden on the Web server is also very large) to generate document objects, the following methods of network links already exist.





# Get Document Object  pdf0 = open (' samplefortest.pdf ', ' RB ')  


Then create the document parser and PDF document object associate them with each other


# Create a parser associated with the document
parser = PDFParser (pdf0)
  
# Create a PDF document object
doc = PDFDocument ()
  
# Connect the two
parser.set_document (doc)
doc.set_parser (parser)


ForPDF document object initialization, if the document itself is encrypted, you need to add the passwordparameter


# Document Initialization  doc.initialize (")  





First createPDF Explorer and Parameter Analyzer


# Create PDF Explorer Resources  = # Create PDF Explorer
resources = PDFResourceManager ()
  
# Create parameter analyzer
laparam = LAParams ()
 ()    # Create parameter parser  Laparam = Laparams ()  


And then create one aggregator, and receivePDF explorer as parameter


# Create an aggregator and receive the resource Manager, parameter Analyzer as parameter  


Finally, create a page interpreter with PDF Explorer and aggregator as parameters

# Create a page interpreter
interpreter = PDFPageInterpreter (resources, device)
In this way, the page interpreter has the ability to encode PDF documents and interpret them into a format that Python can recognize.

 

  Finally, use the get_pages () method of the PDF document object to read the page set from the PDF document, then use the page interpreter to read the page set one by one, and then call the aggregator's get_result () method to place the pages one by one In the end, the get_text () method of the layout is used to get the text of each page.



for page in doc.get_pages ():
     # Read page with page interpreter
     interpreter.process_page (page)
     # Read page content using aggregator
     layout = device.get_result ()
  
     for out in layout:
         if hasattr (out, ‘get_text’): # because there is more than text in the document
             print (out.get_text ()) 


It is important to note that there is not only a picture in the PDF documenttext, and so on, in order to ensure that there is no error, first determine whether the object hasget_text()方法


Eight, the result analysis


If the PDF file is only text, then it will be completely parsed out, read the text, there is a TXT document inside, but if there is a picture and other things, it will not be read things.



This article has done three experiments, the PDF document is only the existence of text, only the existence of pictures, the existence of text and pictures.



The results show:


PDF with only text present This program will read all the text
PDF with only pictures present This program will not read anything
There are pictures and text This program will only read the text and will not recognize the picture


Therefore, the image of the word recognition, not only the use of Pdfminer this library, but also the need for image processing and other related technologies.



Learn more about Python parsing and reading pdf file content


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.