In the evening, I looked at the Python network data collection book and saw the code for reading PDF content. I think that a few days ago, souke just published a crawling rule for crawling pdf content on a webpage, this rule can take pdf content as html for web page capturing. 1. Introduction
In the evening, I looked at the Python network data collection book and saw the code for reading PDF content. I think that a few days ago, souke just published a crawling rule for crawling pdf content on a webpage, this rule can take pdf content as html for web page capturing. The magic is that Firefox's ability to parse PDF can convert the pdf format into html tags, such as p tags, in this way, the GooSeeker web page capture software captures structured content like a common Web page.
A problem arises: To what extent can Python crawlers be used. The following describes the experiment process and source code.
2. convert the pdf file to the Python source code of the text.
The following python source code reads the PDF file content (on the Internet or locally), converts it to text, and prints it out. This code mainly uses a third-party library named mongominer3k to read the PDF into a string, and then converts it into a file object using StringIO. (For the source code, see the GitHub source at the end of the article)
from urllib.request import urlopenfrom pdfminer.pdfinterp import PDFResourceManager, process_pdffrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom io import StringIOfrom io import opendef readPDF(pdfFile): rsrcmgr = PDFResourceManager() retstr = StringIO() laparams = LAParams() device = TextConverter(rsrcmgr, retstr, laparams=laparams) process_pdf(rsrcmgr, device, pdfFile) device.close() content = retstr.getvalue() retstr.close() return contentpdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")outputString = readPDF(pdfFile)print(outputString)pdfFile.close()
If the PDF file is in your computer, replace the object pdfFile returned by urlopen with a common open () file object.
3. Outlook
This experiment only converts a pdf file to a text file, but does not convert it to an html tag as described at the beginning. in the Python programming environment, is this capability available for future exploration.