Extract information from PDF ---- PDFMiner, pdf ---- pdfminer

Source: Internet
Author: User
Tags parsing pdf files pdf parser

Extract information from PDF ---- PDFMiner, pdf ---- pdfminer

Today, for some reason, we need to extract the text in the pdf file and search for the information. We found that the author miner is

After extracting the content, although I finally found that all the text in the pdf was images, it was not completed. However, I tried a copy of the text.

The PDF file is quite useful.

PDFMiner ---- PDF parser and analyzer of python

1. Official documentation:Http://www.unixuser.org /~ Euske/python/pdfminer/index.html

2. Features

  • Fully written in python. (Applicable to version 2.4 or later)
  • Parse, analyze, and convert to PDF documents.
  • Support for PDF-1.7 specifications. (Almost)
  • Supports Korean and vertical writing scripts between China and Japan.
  • Support for various Font types (Type1, TrueType, Type3, and CID.
  • Supports basic encryption (RC4.
  • PDF and HTML conversion.
  • Outline (TOC) extraction.
  • TAG content extraction.
  • The original layout is reconstructed by grouping text blocks.

3. Install

Note: An additional installation step is required when the source code is used for installation and processing the China, Japan, and South Korea languages.

4. Usage

4.1 categories used for parsing PDF files:

  • PDFParser: get data from a file
  • Invalid document: stores the obtained data and is associated with the partition parser.
  • PDFPageInterpreter processes page content
  • Translate device into the format you need
  • PDFResourceManager is used to store shared resources, such as fonts or images.

PDFMiner class relationships:

 

4.2 basic usage

4.2.1 parse PDF files

 

1 from incluminer+parser import incluparser 2 from incluminer+document import includocument 3 from incluminer%page import PDFPage 4 from incluminer%page import limit 5 from incluminer%interp import limit 6 from incluminer%interp import limit 7 from incluminer%device import into device 8 9 10 fp = open('mydomainregion ', 'rb') 11 # create a PDF file parser object 12 parser = PDFParser (fp) 13 # create a PDF file object storage document structure 14 # provide password initialization, if not, you do not need to pass this parameter 15 document = plain document (parser, password) 16 # Check whether the file allows Text Extraction 17 if not document. is_extractable: 18 raise custom textextractionnotallowed19 # create a PDF resource manager object to store shared resources 20 rsrcmgr = PDFResourceManager () 21 # create a pdf device object 22 device = Alibaba device (rsrcmgr) 23 # create a PDF parser object 24 interpreter = PDFPageInterpreter (rsrcmgr, device) 25 # process each page in the document 26 for page in PDFPage. create_pages (document): 27 interpreter. process_page (page)

 

Of course, this is only for parsing and layout analysis. My data is from this step

4.2.2 layout analysis

First, modify and add the code in step 1.

1 from consumer miner. layout import LAParams 2 from Xiaoming miner. converter import PDFPageAggregator 3 4 # set parameters for analysis 5 laparams = LAParams () 6 # create a PDF page aggregation object 7 device = PDFPageAggregator (rsrcmgr, laparams = laparams) 8 interpreter = PDFPageInterpreter (rsrcmgr, device) 9 for page in PDFPage. create_pages (document): 10 interpreter. process_page (page) 11 # receives the LTPage object 12 layout = device. get_result ()

 

The LTPage object for each page in the PDF document returned by the layout analysis. This object and the sub-objects contained in the page form a tree structure.

:

  • LTPage: indicates the entire page. It may contain LTTextBox, LTFigure, LTImage, LTRect, LTCurve, and LTLine sub-objects.
  • LTTextBox: indicates that a group of text blocks may be contained in a rectangle. Note that this box is created by Geometric Analysis and is not necessarily
    Indicates a logical boundary of the text. It contains the list of LTTextLine objects. The text returned by the get_text () method.
  • LTTextLine: contains a list of LTChar objects in a single text line. The character alignment is either horizontal or vertical, depending on the writing mode of the text.
    The text returned by the get_text () method.
  • LTChar
  • LTAnno: the actual letters in the text are represented as Unicode strings (?). Note that although an LTChar object has actual boundary,
    No LTAnno object, because these are "virtual" characters, according to the relationship between the two characters (for example, a space) by layout analysis and insert.
  • LTImage: indicates an image object. Embedded images can be in JPEG or other formats, but currently, mongominer does not place much effort on image objects.
  • LTLine: represents a straight line. It can be used to separate text or drawings.
  • LTRect: indicates a rectangle. It can be used as another image or number of the frame.
  • LTCurve: indicates a commonBezr Curve

4.2.3 obtain the directory (outline)

 1 from pdfminer.pdfparser import PDFParser 2 from pdfminer.pdfdocument import PDFDocument 3  4 # Open a PDF document. 5 fp = open('mypdf.pdf', 'rb') 6 parser = PDFParser(fp) 7 document = PDFDocument(parser, password) 8  9 # Get the outlines of the document.10 outlines = document.get_outlines()11 for (level,title,dest,a,se) in outlines:12     print (level, title)

 

5. Personal use

1 #-*-coding: UTF-8-*-2 from using miner‑parser import using parser 3 from using miner‑document import using document 4 from using miner‑page import PDFPage 5 from using miner‑page import using 6 from using miner‑interp import using 7 from using miner‑interp import 8 from using minerpolicdevice import into device 9 from using miner. layout import * 10 from Alibaba miner. converter import PDFPageAggregator11 import os12 OS. chdir (r 'f: \ test') 13 fp = open('pythonout', 'rb') 14 # create a pdf file analyzer 15 parser = PDFParser (fp) 16 # create a PDF document Object Storage document structure 17 document = plain document (parser) 18 # Check whether the file allows Text Extraction 19 if not document. is_extractable: 20 raise textextractionnotallowed21 else: 22 # create a PDF resource manager object to store shared resource 23 rsrcmgr = PDFResourceManager () 24 # set parameters for analysis 25 laparams = LAParams () 26 # create a PDF device object 27 # device = portable device (rsrcmgr) 28 device = PDFPageAggregator (rsrcmgr, laparams = laparams) 29 # create a PDF interpreter object 30 interpreter = PDFPageInterpreter (rsrcmgr, device) 31 # process each page 32 for page in PDFPage. create_pages (document): 33 interpreter. process_page (page) 34 # accept the LTPage object of the page 35 layout = device. get_result () 36 for x in layout: 37 if (isinstance (x, LTTextBoxHorizontal): 38 with open('a.txt ', 'A') as f: 39 f. write (x. get_text (). encode ('utf-8') + '\ n ')

 

I have obtained the text in the book, but it is simple to use. The official document provides a comprehensive explanation. Here is just a small summary.

Note: Please indicate the source for reprinting.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.