Install pip install Pdfminer
Crawling data is the first phase of a data analysis project, and some files are encrypted in PDF format, and need to be parsed after download, using the Pdfminer tool.
Let's start by introducing what is Pdfminer.
Here is an official English introduction:
Pdfminer is a tool for extracting information from PDF documents. Unlike other pdf-related tools, it focuses entirely on getting and analyzing text data. Pdfminer allows one to obtain the exact location of the text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform the PDF files into other text formats (such as HTML). It has a extensible PDF parser that can is used for the other purposes than the text analysis.
Learn its use in two main examples
Example 1:
$ pdf2txt.py-o output.html samples/naacl06-as is-v-c euc-jp-o output.html samples /in IS-P mypassword -from an encrypted PDF file)
Parameters:
-o filename Specifies the output file name. bydefault, it prints the extracted contents to stdoutinchtext format.-P Pageno[,pageno,...] Specifies the comma-separated List of the page numbers to be extracted. Page numbers start at one. Bydefault, it extracts text fromAll pages.-C codec specifies the output codec.-t type specifies the output format. The following formats is currently supported. Text:text format. (Default) html:html format. Not recommended forExtraction purposes because the markup isMessy. Xml:xml format. Provides the most information. Tag:"Tagged PDF"Format. A tagged PDF has it own contents annotated with html-like tags. Pdf2txt tries to extract it content streams rather than inferring its text locations. Tags used here is definedinchThe PDF specification (see§10.7 "Tagged PDF"). -I image_directory Specifies the output directory forimage extraction. Currently only JPEG images is supported.-M Char_margin
Example 2:
$ dumppdf.py---r-i6 foo.pdf > pic.jpeg (extract a JPEG image)
Parameters:
-a instructs to dump all the objects. bydefault, it is only prints the document trailer (like a header).-i objno,objno, ... Specifies PDFObjectIDs to display. comma-separated IDs, or multiple-I options are accepted.-P Pageno,pageno, ... Specifies the page number to be extracted. Comma-separated page numbers, or multiple-p options are accepted. Note that page numbers start at one, not zero.-R (Raw)-b (binary)-T (text) specifies the output format of stream contents. Because the contents of stream objects can be very large, they is omitted when none of the options above isspecified. with-R option, the"Raw"Stream contents is dumped without decompression. With-b option, the decompressed contents is dumped asA binary blob. WITH-T option, the decompressed contents is dumpedinchA text format, similar to repr () manner. When-r or-b option isGiven, no stream header isDisplayed forThe ease of saving it to a file.-T shows the table of contents.
Write your own PDF parsing document:
#-*-coding:utf-8-*- frompdfminer.pdfparser Import Pdfparser frompdfminer.pdfdocument Import pdfdocument frompdfminer.pdfpage Import Pdfpage frompdfminer.pdfpage Import pdftextextractionnotallowed frompdfminer.pdfinterp Import Pdfresourcemanager frompdfminer.pdfinterp Import Pdfpageinterpreter frompdfminer.pdfdevice Import Pdfdevice fromPdfminer.layout Import * frompdfminer.converter Import Pdfpageaggregatorimport os# os.chdir (R'F:\test') FP= Open ('pdf/1202268749.pdf','RB') #来创建一个pdf文档分析器parser=pdfparser (FP) #创建一个PDF文档对象存储文档结构document=pdfdocument (parser) # Check if the file allows text extractionifNot document.is_extractable:raise pdftextextractionnotallowedElse: # Create a PDF Explorer object to store the shared rewards resources Rsrcmgr=Pdfresourcemanager () # Set parameters for analysis Laparams=laparams () # Create a PDF device Object # device=Pdfdevice (rsrcmgr) device=pdfpageaggregator (rsrcmgr,laparams=laparams) # Create a PDF interpreter object interpreter=Pdfpageinterpreter (rsrcmgr,device) # Process each page forPageinchpdfpage.create_pages (document): Interpreter.process_page (page) # accepts Ltpage object layout for this page=Device.get_result () forXinchlayout:if(Isinstance (x,lttextboxhorizontal)): With open ('a.html','a') asF:f.write (X.get_text (). Encode ('Utf-8')+'\ n')
Reference:
Pdfminer Official website: http://www.unixuser.org/~euske/python/pdfminer/index.html
Http://www.cnblogs.com/RoundGirl/p/4979267.html
Crawler PDF Parsing Pdfminer