Python reads PDF content

Source: Internet
Author: User

Read the book "Python Network Data Collection" At night and see the code that reads PDF content, remembering the days before collection of Search customers just released a crawl rule that crawls the PDF content of a Web page , and this rule applies to the case where the PDF content is already in HTML.

Now this Python version of the code is to read the contents of the PDF file (on the Internet or local), I think this is very valuable reference, send a note.
This code is primarily a third-party library Pdfminer3k Read the PDF into a string and convert it to a file object using Stringio.

from urllib.request import urlopenfrom pdfminer.pdfinterp import  pdfresourcemanager, process_pdffrom pdfminer.converter import textconverterfrom  pdfminer.layout import laparamsfrom io import stringiofrom io import  Opendef readpdf (Pdffile):     rsrcmgr = pdfresourcemanager ()      retstr = stringio ()     laparams = laparams ()      device = textconverter (Rsrcmgr, retstr, laparams=laparams)      process_pdf (Rsrcmgr, device, pdffile)     device.close ()      content = retstr.getvalue ()     retstr.close ()      Return contentpdffile = urlopen ("Http://pythonscraping.com/pages/warandpeace/chapter1.pdf") Outputstring = readpdf (pdffile) Print (OUtputstring) Pdffile.close () 

If the PDF file is on your computer, replace the Urlopen returned object pdffile with the normal open () file object.

This article from "Fullerhua blog" blog, declined reprint!

Python reads PDF content

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.