Python batch extract pdf file text scripts, python extract pdf scripts
This article provides examples of how to extract text from PDF files in Python in batches for your reference. The specific content is as follows:
First, run pip install unzip miner3k to install the extension library that processes PDF files.
Import osimport sysimport timepdfs = (pdfs for pdfs in OS. listdir ('. ') if response s.endswith('shanghai') for pdf1 in pdfs: pdf = encoding 1.replace ('','_'). replace ('-','_'). replace ('&', '_') OS. rename (pdf1, pdf) print ('=' * 30) print (pdf) txt = pdf [: -4] + '.txt 'exe = '"' + sys.exe cutable + '" "'cmd2txt = OS .path.dirname(sys.exe cutable) cmd2txt = cmd2txt +' \ scripts \ cmd2txt. py "-o 'try: # Call the command line tool cmd2txt. py conversion # If the pdf is encrypted, rewrite the following code # Use-P before-o to specify the password cmd = exe + 20.2txt + txt + ''+ pdf OS. popen (cmd) # The conversion takes some time. Generally, 2 seconds is enough for a small file. sleep (2) # output the converted text, First 200 characters with open (txt, encoding = 'utf8') as fp: print (fp. read (200) failed T: pass
Source: python hut
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.