Read the book "Python Network Data Collection" At night and see the code that reads PDF content, remembering the days before collection of Search customers just released a crawl rule that crawls the PDF content of a Web page , and this rule applies to the case where the PDF content is already in HTML.
Now this Python version of the code is to read the contents of the PDF file (on the Internet or local), I think this is very valuable reference, send a note.
This code is primarily a third-party library Pdfminer3k Read the PDF into a string and convert it to a file object using Stringio.
from urllib.request import urlopenfrom pdfminer.pdfinterp import pdfresourcemanager, process_pdffrom pdfminer.converter import textconverterfrom pdfminer.layout import laparamsfrom io import stringiofrom io import Opendef readpdf (Pdffile): rsrcmgr = pdfresourcemanager () retstr = stringio () laparams = laparams () device = textconverter (Rsrcmgr, retstr, laparams=laparams) process_pdf (Rsrcmgr, device, pdffile) device.close () content = retstr.getvalue () retstr.close () Return contentpdffile = urlopen ("Http://pythonscraping.com/pages/warandpeace/chapter1.pdf") Outputstring = readpdf (pdffile) Print (OUtputstring) Pdffile.close ()
If the PDF file is on your computer, replace the Urlopen returned object pdffile with the normal open () file object.
This article from "Fullerhua blog" blog, declined reprint!
Python reads PDF content