Recognition image extraction in PDF files

Source: Internet
Author: User

The PDF (Portable Document Format) file has become an open standard for international electronic document delivery. It is a new output file format following the postscript file format. With its advantages, PDF overcomes the recognition problems frequently encountered during the electronic file sharing process, allowing users to freely browse files and conveniently exchange files on the Internet, it is an ideal format for modern electronic document delivery. A large number of images exist in PDF files. In addition to common image recognition, the recognition of images in PDF files also has a lot of significance due to the strong adaptability of PDF files to the network. Searching for PDF files on the Internet accounts for a large proportion. To search for PDF files based on the image content, you must identify the images. Only by extracting image information from PDF files can we further use image recognition software to identify these images. It can be seen that the extraction and conversion of images in PDF files into identifiable formats will promote the development of image recognition and information processing.

1. pdf file structure 1.1 The organizational structure of a PDF file a standard PDF file consists of four parts: (1) header: Specifies the version of the file that complies with the PDF. (2) file body: displays a series of indirect objects in a PDF document. (3) cross-reference table: the indirect object information about the file is written in the cross-reference table. Each indirect object corresponds to an item in the table, indicating the location of the object in the file body and whether it is referenced. The table contains one or more cross-reference sections ). Initially, the entire table contains a (if a linear file contains two) cross-reference area. Each time a file is updated, a cross-reference area is added. Each cross-reference area starts with the xref keyword that occupies a row. The row is followed by one or more cross-reference subareas (cross-referencesubsection) that can appear in any order. (4) trailer: The trace part provides the position of the cross-referenced table in the file and the position of some objects with special functions in the file body. The end of the tracing part is the end mark % EOF. Before the end mark of a file, it is the offset between the keyword startxref and the keyword xref starting from the file to the last cross-reference area. Before the keyword startxref, It is a trace dictionary, which consists of the keywords trailer and the series of keyword value pairs in the <>, as shown in 2. 1.2 PDF file image classification PDF files have two types of images: ① image External Object: A referenced object with a name defined out of content stream. The internal description of an xobject depends on its type. ② embedded images: directly embed image attributes and data into a small image in a content stream. The types of images that can be presented in this way are limited. Generally, the image size is within 4 kb.

2. Extract the images that can be recognized from the PDF file. 2.1 processing process. The key to recognizing the images in the PDF file is to locate the specific position of the images in the PDF file. The positioning steps are divided into six independent layers, links between layers. Figure 3 process of finding the image information to be processed from the PDF file 2.2 process the PDF file as follows: Step 1 process the file header, obtain the version information and determine whether the version is supported by the program. Step 2 process the cross-referenced table and tracing section. Step 3 process directory objects. Step 4. process the page tree and query the object information record table with the root object marker of the read page to get the offset of the page root object, so as to find the page root object in the file. Obtain the total number of existing pages in the document from the count item of the page root object, and allocate space for the page record table. The page record table is used to record the Page Object flag corresponding to each page of the document in sequence in the program. Use recursive methods to traverse the page tree from the page root and fill in page record 4. Step 5 create an image object record table, the image object record table is used to record the object Flag Information of the image xobject (excluding unrecognized thumbnails, backup images, or image masks) according to the sequence in which the image appears when the file is browsed. The procedure for creating an image object record table is as follows :? Use this object flag to query the object information record table and get the offset of the page object, so as to find this page object in the file ;? Check whether resource items are written in the object on this page ;? If a resource item is written out, check whether there is an xobject item in the resource dictionary. If so, check whether the image object is mentioned in the xobject stream, then, enter the object mark in the indirect reference of the image object in the record table of the image object. Resource items are mandatory for a page object and can be inherited. However, here we only search for resource items in the Page Object. The search for resource items is only to get the xobject (an option) on a page, and if the resource on a page contains xobject, such resources should be unique resources on this page and should not be inherited. If a thumbnail, backup image, or Image Mask exists in the PDF file, you can also use the image xobject to describe it. These types of images are not to be processed during image recognition. Therefore, the object mark information of the corresponding image xobject should not be stored in the image object record table. In the resource Dictionary of the object on the page, the indirect references of the xobject listed in the xobject stream do not include indirect references to them, therefore, entering the image object record table as described does not fill in the image mark information. The image Flag Information in the image information table is filled in the order that the image appears in the document after the file is opened in the browser. Step 6: locate the image by performing the following steps :? Read image xobject ;? Read the image content, record the image-related attributes (including the length of the stream data), the offset of the image data in the PDF file, and the filter used. If the filter has parameters, the corresponding parameters are also recorded. Generally, streaming data does not involve external files here ;? Usually, image recognition requires a certain size of the image, and the image width and height must be within a certain range. Do not process images that do not meet the requirements. Some filters used for image stream data may correspond to copyright-based compression and decompression algorithms, and the images involved in this situation are not further processed. The PDF file processing process is shown in table 1.

3. After image recognition, 3.1 of the processed images are converted into recognizable image files. Currently, most image recognition software can recognize the formats of bitmap, Tiff, and JPEG images. (1) If an image xobject does not use a filter, fill in the header information of the bitmap file to be output using the recorded image object attributes, then, enter the encoded image stream data to the corresponding image data area in the bit-map file to be output. (2) If the image xobject uses a filter, there are two ways to output the image files that can be identified :? Full decoding method: Fill in the header information of the bitmap file to be output using the recorded image object attributes, and decode the image data stream using the decoding algorithm specified by the filter, fill in the image data area in the output file with the decoded image stream data ;? Selection decoding method: Some filters correspond to the compression algorithm used by image recognition software to recognize the image format. If the image object uses this filter, data decoding is not required. Based on the type of the filter, the output file type is determined. The recorded image information is used to fill in the header of the output file in this format and other information tables involved. No decoded image stream data is returned, directly fill in the data area of the output image file. If the filter does not conform to the File compression algorithm recognized by the recognition software, the bitmap output mode is decoded. 3.2 process of producing identifiable image files this example describes how to create output files that can be identified when an image xobject to be processed uses ccittfaxdecode as a filter. The ccittfaxdecode filter used in PDF corresponds to data encoded by group 3 or group 4 CCITT. These standards are exactly included in the compression algorithm used for Tiff format image files. In this case, both the full decoding method and the selective decoding method can be used. This example uses the select decoding method to produce the output image file, which involves the format of the tiff 6.0 Image File. 3.3 create output files that can be identified and generate output files that can be identified. The steps are as follows: Step 1 Write the file header. 1. Write 49 H 49 H 42 h 00 h to 0th ~ 3 bytes, indicating that the TIFF file is written in the front of the low byte and compatible with the previous version. ② The offset of IFD written in the next four bytes. The value can be specified using the following formula: IFD offset = 8 + image Stream Data Length) + 1)> 1 <1 (the shift operation is to ensure that IFD starts at the word boundary. Step 2 write image data. Write the image stream data in the PDF file to the output file (starting from 8th bytes ). Step 3 Fill in IFD. Based on the IFD offset, the read/write pointer of the output file is moved to the starting position of IFD. Write an Hort-type B in the first two bytes, indicating that 11 labels are required. There are many tags in Tiff, but the 11 tags listed here can correctly display the image. Enter the following tag imagewidth imagelength compression photometricinterpre-ation stripoffsets rowsperstrip stripbytecounts xresolutionyresolution t4options or t6options Based on recorded image attributes and filter parameters. Note, write (0 in four bytes indicates no IFD.)

4. Multiple filters can be used in PDF files, but some filter encoding and decoding algorithms involve copyrights, such as LZWDEcode. Therefore, images using such filters cannot be decoded to extract images that can be identified. Although multiple filters are allowed for PDF files, various filters are usually fixed in applications, and some filters are rarely used to process image stream data. Generally, the ccittfaxdecode or dctdecode filters are used to process image streaming data.

 

This should be the extraction of recognizable images from PDF file, but I cannot find this free ebook, if a friend finds it, please let me know!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.