Crawler PDF Parsing Pdfminer

Last Update:2016-04-29 Source: Internet

Author: User

Tags pdf parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Install pip install Pdfminer

Crawling data is the first phase of a data analysis project, and some files are encrypted in PDF format, and need to be parsed after download, using the Pdfminer tool.

Let's start by introducing what is Pdfminer.

Here is an official English introduction:

Pdfminer is a tool for extracting information from PDF documents. Unlike other pdf-related tools, it focuses entirely on getting and analyzing text data. Pdfminer allows one to obtain the exact location of the text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform the PDF files into other text formats (such as HTML). It has a extensible PDF parser that can is used for the other purposes than the text analysis.

Learn its use in two main examples

Example 1:

$ pdf2txt.py-o output.html samples/naacl06-as is-v-c euc-jp-o output.html samples /in IS-P mypassword -from an encrypted PDF file)

Parameters:

-o filename Specifies the output file name. bydefault, it prints the extracted contents to stdoutinchtext format.-P Pageno[,pageno,...] Specifies the comma-separated List of the page numbers to be extracted. Page numbers start at one. Bydefault, it extracts text fromAll pages.-C codec specifies the output codec.-t type specifies the output format.        The following formats is currently supported. Text:text format. (Default) html:html format. Not recommended forExtraction purposes because the markup isMessy. Xml:xml format.        Provides the most information. Tag:"Tagged PDF"Format. A tagged PDF has it own contents annotated with html-like tags. Pdf2txt tries to extract it content streams rather than inferring its text locations. Tags used here is definedinchThe PDF specification (see§10.7 "Tagged PDF"). -I image_directory Specifies the output directory forimage extraction. Currently only JPEG images is supported.-M Char_margin

Example 2:

$ dumppdf.py---r-i6 foo.pdf > pic.jpeg (extract a JPEG image)

Parameters:

-a instructs to dump all the objects. bydefault, it is only prints the document trailer (like a header).-i objno,objno, ... Specifies PDFObjectIDs to display. comma-separated IDs, or multiple-I options are accepted.-P Pageno,pageno, ... Specifies the page number to be extracted. Comma-separated page numbers, or multiple-p options are accepted. Note that page numbers start at one, not zero.-R (Raw)-b (binary)-T (text) specifies the output format of stream contents. Because the contents of stream objects can be very large, they is omitted when none of the options above isspecified. with-R option, the"Raw"Stream contents is dumped without decompression. With-b option, the decompressed contents is dumped asA binary blob. WITH-T option, the decompressed contents is dumpedinchA text format, similar to repr () manner. When-r or-b option isGiven, no stream header isDisplayed forThe ease of saving it to a file.-T shows the table of contents.

Write your own PDF parsing document:

#-*-coding:utf-8-*- frompdfminer.pdfparser Import Pdfparser frompdfminer.pdfdocument Import pdfdocument frompdfminer.pdfpage Import Pdfpage frompdfminer.pdfpage Import pdftextextractionnotallowed frompdfminer.pdfinterp Import Pdfresourcemanager frompdfminer.pdfinterp Import Pdfpageinterpreter frompdfminer.pdfdevice Import Pdfdevice fromPdfminer.layout Import * frompdfminer.converter Import Pdfpageaggregatorimport os# os.chdir (R'F:\test') FP= Open ('pdf/1202268749.pdf','RB') #来创建一个pdf文档分析器parser=pdfparser (FP) #创建一个PDF文档对象存储文档结构document=pdfdocument (parser) # Check if the file allows text extractionifNot document.is_extractable:raise pdftextextractionnotallowedElse: # Create a PDF Explorer object to store the shared rewards resources Rsrcmgr=Pdfresourcemanager () # Set parameters for analysis Laparams=laparams () # Create a PDF device Object # device=Pdfdevice (rsrcmgr) device=pdfpageaggregator (rsrcmgr,laparams=laparams) # Create a PDF interpreter object interpreter=Pdfpageinterpreter (rsrcmgr,device) # Process each page forPageinchpdfpage.create_pages (document): Interpreter.process_page (page) # accepts Ltpage object layout for this page=Device.get_result () forXinchlayout:if(Isinstance (x,lttextboxhorizontal)): With open ('a.html','a') asF:f.write (X.get_text (). Encode ('Utf-8')+'\ n')

Reference:

Pdfminer Official website: http://www.unixuser.org/~euske/python/pdfminer/index.html

Http://www.cnblogs.com/RoundGirl/p/4979267.html

Crawler PDF Parsing Pdfminer

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More