PDF data extraction ------ 3. parsing Demo, ------ 3 demo

Source: Internet
Author: User

PDF data extraction ------ 3. parsing Demo, ------ 3 demo

 

1. capture key value information in text string format in PDF (completed)

Introduction: This type of resolution is relatively traditional. It is the simplest. It mainly uses Regular Expression for semantic recognition and verification. For example, it captures key information in the red circle below.

String mettingData = GetMeetingData (); public string GetMeetingData () {string patternAll = @"(? <NDAandCAMDate> \ s * Discussion \ s *. {2, 15} \ d {2, 4} \ s * Year \ s * \ d {1, 2} \ s * month \ s * \ d {1, 2} \ s * day. {0, 15}) "; analytic analyzer pa = new analytic analyzer (); analytic net. initialize (); latest doc = new latest doc (item); doc. initSecurityHandler (); List <symbol string> foundAll = pa. regexSearchAllPages (doc, patternAll); List <string> patternFilter = new List <string> (); patternFilter. add (@"(? <Year> \ d {2, 4}) year (? <Month> \ d {1, 2}) month (? <Day> \ d {1, 2}) (\ (| \ () (week | week) (1 | 2 | 3 | 4 | 5 | 6 | 7) (\) | \)))? (AM )? (? <Hour> \ d {1, 2}) (\:| point | hour )(? <Minute> \ d {1, 2}) "); patternFilter. Add (@"(? <Year> \ d {2, 4}) year (? <Month> \ d {1, 2}) month (? <Day> \ d {1, 2}) (\ (| \ () (week | week) (1 | 2 | 3 | 4 | 5 | 6 | 7) (\) | \)))? Afternoon (? <Hour> \ d {1, 2}) (\:| point | hour )(? <Minute> \ d {1, 2}) "); patternFilter. Add (@"(? <Year> \ d {2, 4}) year (? <Month> \ d {1, 2}) month (? <Day> \ d {1, 2}) (\ (| \ () (week | week) (1 | 2 | 3 | 4 | 5 | 6 | 7) (\) | \)))? (AM )? (? <Hour> \ d {1, 2}) point half "); patternFilter. Add (@"(? <Year> \ d {2, 4}) year (? <Month> \ d {1, 2}) month (? <Day> \ d {1, 2}) (\ (| \ () (week | week) (1 | 2 | 3 | 4 | 5 | 6 | 7) (\) | \)))? Afternoon (? <Hour> \ d {1, 2}) point half "); patternFilter. Add (@"(? <Year> \ d {2, 4}) year (? <Month> \ d {1, 2}) month (? <Day> \ d {1, 2}) (\ (| \ () (week | week) (1 | 2 | 3 | 4 | 5 | 6 | 7) (\) | \)))? (AM )? (? <Hour> \ d {1, 2}) (point | hour) "); patternFilter. Add (@"(? <Year> \ d {2, 4}) year (? <Month> \ d {1, 2}) month (? <Day> \ d {1, 2}) (\ (| \ () (week | week) (1 | 2 | 3 | 4 | 5 | 6 | 7) (\) | \)))? Afternoon (? <Hour> \ d {1, 2}) (point | hour) "); patternFilter. Add (@"(? <Year> \ d {2, 4}) year (? <Month> \ d {1, 2}) month (? <Day> \ d {1, 2}) "); return GetMeetingDateFilter (foundAll, patternAll);} private string GetMeetingDateFilter (List <symbol string> foundAll, List <string> patternAll) {string meetingDate = ""; Match ma = null; string result = string. empty; foreach (random string character string in foundAll) {result = random string. toString (). replace ("", ""); for (int I = 0; I <patternAll. count; I ++) {ma = (new Regex (patternAll [I]). match (result); if (ma. success) {if (IsValid (ma) return meetingDate; else meetingDate = "" ;}} return meetingDate ;}

Note:

A. For the first time, search for all time data through pa. RegexSearchAllPages (doc, patternAll );

B. Obtain the keyword information Meeting Data through regular expression matching for the second time.

 

2. PDF is similar to table-based key value data capturing. (Completed)

Summary: This format requires the encapsulated Data Structure javasstring and analyzer classes to extract data within a specified range based on a given keyword, for example, extract the following data.

Private string GetPremium (string path, string ricCode) {string result = string. empty; export doc = null; try {export net. initialize (); doc = new extension doc (path); doc. initSecurityHandler (); if (doc = null) {string msg = string. format ("can't load pdf to doc = new batch doc ({0});", path); Logger. log (msg, Logger. logType. error); return result;} int x1 = 0; int y1 = 0; analytic analyzer pa = new analytic analyzer (); Lis T <strong string> listX1 = pa. regexSearchAllPages (doc, ricCode); List <shortstring> listY1 = pa. regexSearchAllPages (doc, @ "[P | p] remium"); List <strong string> listResult = pa. regexSearchAllPages (doc ,@"(? <Result> \ d + \. \ d + \ %) "); if (listX1.Count = 0 | listY1.Count = 0 | listResult. count = 0) {string msg = string. format ("({0}), ([P | p] remium) exist missing value, so Gearing is empty value. ", ricCode); Logger. log (msg, Logger. logType. warning); return result;} x1 = System. convert. toInt32 (listX1 [0]. position. x1); y1 = System. convert. toInt32 (listY1 [0]. position. y1); int subX1 = 0; int subY1 = 0; // use G Earing position (x1, y1) to get the right result value foreach (var item in listResult) {subX1 = x1-System. convert. toInt32 (item. position. x1); if (subX1 <0) subX1 = 0-subX1; subY1 = y1-System. convert. toInt32 (item. position. y1); if (subY1 <0) subY1 = 0-subY1; if (subX1 <= 10 & subY1 <= 10) {result = item. toString (). replace ("%", ""); return result ;}} Logger. log (string. format ("stock code: {0}, extract premium failed. ", ricCode), Logger. logType. error); return result;} catch (Exception ex) {string msg = string. format ("PDF analysis failed for" + ricCode + "! Action: Need manually input gearing and premium \ r \ n error msg: {0} ", ex. message); Logger. log (msg, Logger. logType. warning); return result ;}}

 

3. convert a large amount of data in PDF to Excel (completed)

Introduction: the extension of the base and 2 adds an automatic fuzzy match to the row and column boundary range, and extracts the correct data information according to the Location Coordinate sorting.

Private void StartExtractFile () {List <string> bulkFileFilter = null; List <LineFound> bulkFile = null; cmdnet. initialize (); export doc = new export doc (config. filePath1); doc. initSecurityHandler (); string patternTitle = @ "your desired handler"; int page = 3; your string ricPosition = GetRicPosition (doc, patternTitle, page); if (ricPosition = null) return; string patternRic = @ "\ d {4}"; string patternValue = @ "(\-| \ + )? \ D + (\, | \. | \ d) + "; bulkFile = GetValue (doc, ricPosition, patternRic, patternValue); int indexOK = 0; bulkFileFilter = FilterBulkFile (bulkFile, indexOK); string filePath = Path. combine (config. outputFolder, string. format ("type1extractedfrompdf1_02.16.csv", DateTime. now. toString ("dd-MM-yyyy"); if (File. exists (filePath) File. delete (filePath); XlsOrCsvUtil. generateStringCsv (filePath, bulkFileFilter); AddResult (Path. getFileNameWithoutExtension (filePath), filePath, "type1");} private List <string> FilterBulkFile (List <LineFound> bulkFile, int indexOK) {List <string> result = new List <string> (); if (bulkFile = null | bulkFile. count = 0) {Logger. log ("no value data extract from pdf"); return null;} int count = bulkFile [indexOK]. lineData. count; List <string> line = null; foreach (var item in bulkFile) {if (item. lineData = null | item. lineData. count <= 0) continue; line = new List <string> (); if (item. lineData. count. compareTo (count) = 0) {foreach (var value in item. lineData) {line. add (value. words. toString () ;}} else {line. add (item. lineData [0]. words. toString (); for (int I = 1; I <count; I ++) {line. add (string. empty) ;}} result. add (line) ;}return result ;}private List <LineFound> GetValue (Response doc, response string ricPosition, string patternRic, string patternValue) {List <LineFound> bulkFile = new List <LineFound> (); try {List <string> line = new List <string> (); List <strong string> ric = null; // for (int I = 1; I <10; I ++) for (int I = 1; I <doc. getPageCount (); I ++) {ric = pa. regextractbypositionwithpage (doc, patternRic, I, ricPosition. position); foreach (var item in ric) {LineFound lineFound = new LineFound (); lineFound. ric = item. words. toString (); lineFound. position = item. position; lineFound. pageNumber = I; lineFound. lineData = pa. regexExtractByPositionWithPage (doc, patternValue, I, item. position, PositionRect. x2); bulkFile. add (lineFound) ;}} catch (Exception ex) {string msg = string. format ("\ r \ n ClassName: {0} \ r \ n MethodName: {1} \ r \ n Message: {2}", System. reflection. methodBase. getCurrentMethod (). declaringType. toString (), System. reflection. methodBase. getCurrentMethod (). name, ex. message); Logger. log (msg, Logger. logType. error);} return bulkFile;} private reverse string GetRicPosition (reverse doc, string pattern, int page) {try {List <reverse string> ricPosition = null; ricPosition = pa. regexSearchByPage (doc, @ "invalid parameter", page); if (ricPosition = null | ricPosition. count = 0) {Logger. log (string. format ("there is no ric title found by using pattern: {0} to find the ric title, in the page: {1} of the pdf: {2 }")); return null;} return ricPosition [0];} catch (Exception ex) {string msg = string. format ("\ r \ n ClassName: {0} \ r \ n MethodName: {1} \ r \ n Message: {2}", System. reflection. methodBase. getCurrentMethod (). declaringType. toString (), System. reflection. methodBase. getCurrentMethod (). name, ex. message); Logger. log (msg, Logger. logType. error); throw ;}} struct LineFound {public string Ric {get; set;} public Rect Position {get; set ;} public int PageNumber {get; set ;} public List <shortstring> LineData {get; set ;}}

Note:

A. Because the coordinate position information of data in PDF files is page-based, you must parse and capture data by page.

B. The general idea is that the first time we get the location of "getting sorted", we can get the set of Ric lists on each page (get and sort columns)

C. Obtain and sort the information of each row based on the information of each column and combine them into table information.

Improvement:

Now this part still requires manual intervention in the Code. The next step is to add the automatic identification function to automatically synthesize Table information based on location information groups by obtaining a large amount of PDF data.

 

4. Save the image format in PDF (unfinished)

Idea: I don't have a good solution to this PDF file. I need to use image recognition algorithms. In this file format, I can't do anything now,

I hope the great god can provide some good suggestions.


How to extract images from pdf files?

PDF export as an image:
1. Open your PDF file with adobe acrobat
2. Menu Bar file ---- export --- image ---- JPEG
3. The Export Dialog Box is displayed. You can click the setting button in the lower right corner of the dialog box to save the settings ~~

Java reads pdf content

Use Java to read data in PDF files:
Step 1: Download PDFBox-0.7.2.jar. Provide one: Producer. (I put the source code and jar package in the attachment below for your use .)
Step 2: write a simple program for reading PDF files. (PdfReader. java)
Import java. io. File;
Import java. io. FileOutputStream;
Import java. io. OutputStreamWriter;
Import java. io. Writer;
Import java.net. MalformedURLException;
Import java.net. URL;
Import orgdomainbox. pdmodel. PDDocument;
Import orgdomainbox. util. extends textstripper;
Public class PdfReader {
Public void readFdf (String file) throws Exception {
// Sort or not
Boolean sort = false;
// Pdf file name
String pdfFile = file;
// Enter the text file name
String textFile = null;
// Encoding method
String encoding = "UTF-8 ";
// Start page Extraction
Int startPage = 1;
// End number of extracted pages
Int endPage = Integer. MAX_VALUE;
// File input stream to generate a text file
Writer output = null;
// PDF Document stored in memory
PDDocument document = null;
Try {
Try {
// First load the file as a URL, and then load the file from the local file system if an exception occurs. //
URL url = new URL (pdfFile );
// Note that the parameter is not a URL in the previous version, but a File.
Document = PDDocument. load (pdfFile );
// Obtain the PDF file name
String fileName = url. getFile ();
// Name the generated txt file with the original PDF name
If (fileName. length ()> 4 ){
File outputFile = new File (fileN ...... remaining full text>
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.