Crawler PDF Parsing Pdfminer

Source: Internet
Author: User
Tags pdf parser

Install pip install Pdfminer

Crawling data is the first phase of a data analysis project, and some files are encrypted in PDF format, and need to be parsed after download, using the Pdfminer tool.

Let's start by introducing what is Pdfminer.

Here is an official English introduction:

Pdfminer is a tool for extracting information from PDF documents. Unlike other pdf-related tools, it focuses entirely on getting and analyzing text data. Pdfminer allows one to obtain the exact location of the text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform the PDF files into other text formats (such as HTML). It has a extensible PDF parser that can is used for the other purposes than the text analysis.

Learn its use in two main examples

Example 1:

$ pdf2txt.py-o output.html samples/naacl06-as is-v-c euc-jp-o output.html samples /in IS-P mypassword -from an encrypted PDF file)

Parameters:

-o filename Specifies the output file name. bydefault, it prints the extracted contents to stdoutinchtext format.-P Pageno[,pageno,...] Specifies the comma-separated List of the page numbers to be extracted. Page numbers start at one. Bydefault, it extracts text fromAll pages.-C codec specifies the output codec.-t type specifies the output format.        The following formats is currently supported. Text:text format. (Default) html:html format. Not recommended forExtraction purposes because the markup isMessy. Xml:xml format.        Provides the most information. Tag:"Tagged PDF"Format. A tagged PDF has it own contents annotated with html-like tags. Pdf2txt tries to extract it content streams rather than inferring its text locations. Tags used here is definedinchThe PDF specification (see§10.7 "Tagged PDF"). -I image_directory Specifies the output directory forimage extraction. Currently only JPEG images is supported.-M Char_margin

Example 2:

$ dumppdf.py---r-i6 foo.pdf > pic.jpeg (extract a JPEG image)

Parameters:

-a instructs to dump all the objects. bydefault, it is only prints the document trailer (like a header).-i objno,objno, ... Specifies PDFObjectIDs to display. comma-separated IDs, or multiple-I options are accepted.-P Pageno,pageno, ... Specifies the page number to be extracted. Comma-separated page numbers, or multiple-p options are accepted. Note that page numbers start at one, not zero.-R (Raw)-b (binary)-T (text) specifies the output format of stream contents. Because the contents of stream objects can be very large, they is omitted when none of the options above isspecified. with-R option, the"Raw"Stream contents is dumped without decompression. With-b option, the decompressed contents is dumped asA binary blob. WITH-T option, the decompressed contents is dumpedinchA text format, similar to repr () manner. When-r or-b option isGiven, no stream header isDisplayed forThe ease of saving it to a file.-T shows the table of contents.

Write your own PDF parsing document:

#-*-coding:utf-8-*- frompdfminer.pdfparser Import Pdfparser frompdfminer.pdfdocument Import pdfdocument frompdfminer.pdfpage Import Pdfpage frompdfminer.pdfpage Import pdftextextractionnotallowed frompdfminer.pdfinterp Import Pdfresourcemanager frompdfminer.pdfinterp Import Pdfpageinterpreter frompdfminer.pdfdevice Import Pdfdevice fromPdfminer.layout Import * frompdfminer.converter Import Pdfpageaggregatorimport os# os.chdir (R'F:\test') FP= Open ('pdf/1202268749.pdf','RB') #来创建一个pdf文档分析器parser=pdfparser (FP) #创建一个PDF文档对象存储文档结构document=pdfdocument (parser) # Check if the file allows text extractionifNot document.is_extractable:raise pdftextextractionnotallowedElse: # Create a PDF Explorer object to store the shared rewards resources Rsrcmgr=Pdfresourcemanager () # Set parameters for analysis Laparams=laparams () # Create a PDF device Object # device=Pdfdevice (rsrcmgr) device=pdfpageaggregator (rsrcmgr,laparams=laparams) # Create a PDF interpreter object interpreter=Pdfpageinterpreter (rsrcmgr,device) # Process each page forPageinchpdfpage.create_pages (document): Interpreter.process_page (page) # accepts Ltpage object layout for this page=Device.get_result () forXinchlayout:if(Isinstance (x,lttextboxhorizontal)): With open ('a.html','a') asF:f.write (X.get_text (). Encode ('Utf-8')+'\ n')

Reference:

Pdfminer Official website: http://www.unixuser.org/~euske/python/pdfminer/index.html

Http://www.cnblogs.com/RoundGirl/p/4979267.html

Crawler PDF Parsing Pdfminer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.