Python reads PDF content

Source: Internet
Author: User
In the evening, I looked at the Python network data collection book and saw the code for reading PDF content. I think that a few days ago, souke just published a crawling rule for crawling pdf content on a webpage, this rule can take pdf content as html for web page capturing. 1. Introduction

In the evening, I looked at the Python network data collection book and saw the code for reading PDF content. I think that a few days ago, souke just published a crawling rule for crawling pdf content on a webpage, this rule can take pdf content as html for web page capturing. The magic is that Firefox's ability to parse PDF can convert the pdf format into html tags, such as p tags, in this way, the GooSeeker web page capture software captures structured content like a common Web page.

A problem arises: To what extent can Python crawlers be used. The following describes the experiment process and source code.

2. convert the pdf file to the Python source code of the text.

The following python source code reads the PDF file content (on the Internet or locally), converts it to text, and prints it out. This code mainly uses a third-party library named mongominer3k to read the PDF into a string, and then converts it into a file object using StringIO. (For the source code, see the GitHub source at the end of the article)

from urllib.request import urlopenfrom pdfminer.pdfinterp import PDFResourceManager, process_pdffrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom io import StringIOfrom io import opendef readPDF(pdfFile):    rsrcmgr = PDFResourceManager()    retstr = StringIO()    laparams = LAParams()    device = TextConverter(rsrcmgr, retstr, laparams=laparams)    process_pdf(rsrcmgr, device, pdfFile)    device.close()    content = retstr.getvalue()    retstr.close()    return contentpdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf")outputString = readPDF(pdfFile)print(outputString)pdfFile.close()

If the PDF file is in your computer, replace the object pdfFile returned by urlopen with a common open () file object.

3. Outlook

This experiment only converts a pdf file to a text file, but does not convert it to an html tag as described at the beginning. in the Python programming environment, is this capability available for future exploration.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.