PDF data extraction ------ 1. Introduction

Source: Internet
Author: User
Tags parsing pdf files

1. About PDF files

 

PDF(Portable Document Format("Portable File Format") is a file format developed by Adobe Systems for file exchange in 1993. It has the advantages of cross-platform, retaining the original file format (layout), opening standards, and freely authorizing the free development of PDF compatible software. (PDF-Wikipedia)

 

2. About parsing PDF

Just like the blog post on the source of great idea about PDF code, there is a lot of work ..., as the internship company needs to extract financial data from a large number of documents. we have powerful htmlagilitypack and scrapysharp for web page parsing. excel and Word are directly used in the office family. net. There is no uniform solution for processing PDF files. Of course, I have not studied all the PDF tools. I feel that most tools are not perfect for parsing PDF functions. (Maybe I have a short knowledge, but I haven't met a tool for parsing PDF files like parsing Web pages.) Now, the company has a system with a PDF data parsing module. This module first converts PDF files to HTML files and then parses HTML files. There are already many perfect solutions for parsing Markup Language HTML, but the problem is that due to the characteristics of PDF files, parsing PDF files is inherently not guaranteed to be correct, however, it is not even more difficult to parse the converted HTML. After inquiry, I learned that this method did not do well in parsing PDF.

 

3. My Solutions

There are many PDF types and quantities to be parsed in the company. For analyst, it is extremely painful to manually extract information from a large number of PDF files every day, and it is also a huge test for the eyes and body and mind. Of course, different PDF parsing strategies are also different. I plan to introduce my solutions based on different PDF files. My solution is based on javasnet. dll encapsulated a resolution solution for the company's business. Of course, this DLL is not open-source, but it is okay in the debug version. I accidentally forgot to import the license and released the release product internally. Only one computer is unavailable. Of course, the company will not care about this small amount of money. We have a license. I hope that the great gods in the garden can give me some suggestions or propose better solutions!

 

4. Category Introduction

Based on different requirements, I plan to introduce this PDF resolution solution in series.

1. capture key value information in text string format in PDF (completed)

Introduction: this resolution is the simplest and most traditional method. It mainly uses regular expression for semantic recognition and verification.

2. PDF is similar to table-based key value data capturing. (Completed)

Description: This format encapsulates the data structures of the consumer string class and the consumer analyzer class, and extracts data within a specified range based on the given keywords.

3. convert a large amount of data in PDF to excel (completed)

Introduction: the extension of the base and 2 adds an automatic fuzzy match to the row and column boundary range, and extracts the correct data information according to the Location Coordinate sorting.

4. Save the image format in PDF (unfinished)

Idea: I don't have a good solution to this PDF file. I need to use image recognition algorithms.

PDF data extraction ------ 1. Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.