This article mainly describes the problem of solving the Python bulk read and write. doc file, has a certain reference value, now share to everyone, the need for friends can refer to
Objective:
Java language Read and write. doc garbled problem:
We all know that when we read and write to the. doc file in the Java language, whether the content of the. doc file is streamed to the console (console) or written to a different file, whatever encoding format (UTF-8,GBK, etc.) you take, you see the content 99% It's all garbled.
Java language Read and write. Doc's garbled problem causes analysis:
The. doc file is one of the software developed by Microsoft for Office editing, and if the font format of a Word document is Utf-8, you should be able to read and write the document in utf-8 format, but once the font size in your Word document changes, Font plus color attributes, font plus some style, then the format of the Word document changed, and no longer utf-8, so the output of the Utf-8 format 99% is garbled.
Use the Java language to read and write the. doc document to avoid garbled solution: (Sun PK Microsoft Corporation)
You can take advantage of the POI package developed by Sun, which provides an interface to modify Microsoft Office software, which uses POI packets to read and write to. doc files, usually without garbled characters. If you see here you probably think, I can finally use Java to deal with the. doc file, then I want to say, you happy too early. As far as I know, until December 22, 2017, The latest version of the POI package is 3.1.7 version, you may not have any concept of the version, 3.1.7 POI package can only handle Microsoft 2007 version of Word,excel,ppt and so on, that is poi3.1.7 version of the jar package does not support the processing of our computer top with word2016, so it can be said that you can Discard the use of Java read-write word2016. But you can also try to use other interfaces to deal with word, but the efficiency is not higher than the POI interface, fortunately, the official website shows that the latest version of the POI will be launched in December 2017, but as of December 22, 2017, I have not yet seen this jar package on the official website.
Body:
Python is better at handling document language processing than Java, after all, Python's combination of regular expressions is still strong in natural language processing. Recent projects in deep learning require parsing and processing of hundreds of orders of magnitude. doc files. As is known to all, Python reads and writes. TXT documents can be said all the way, regardless of your Chinese format, Python in the read and write. docx document, also relatively smooth, up to you need to install Python-docx (0.8.6) at the command line, can read and write. docx documentation, specific read and write scenarios, described below.
Issue: Python cannot read. doc files (not. docx files)
Solution: Use Python to convert a large number of. doc files to a. docx file, and then read and write. docx files
Problem analysis: Python leverages python-docx (0.8.6) Libraries can read. docx files or. txt files, and all the way to the. doc file itself Python is powerless, there are many students are not convinced that I manually changed the suffix name of the. doc file to. docx or. txt does not solve the problem? The answer is no, simply modify the suffix name, then the file is you play bad, don't say, is open is also the Heavenly book AH (garbled). Python is unable to manipulate. doc file is his congenital defect, but we should not be dead in the Internet to find a source code directly read. doc file, a call is good, but unfortunately, you may not find a solution on the Internet. While I was helpless, I took the. doc document with a manual "Save as". docx document to successfully open the converted. docx document, so I tried to use the code to complete this manual "Save as" function, the problem was resolved.
Directly on the Python code (first you need to install the PYPEWIN32 library first):
#-*-Coding:utf-8-*-:import sysimport pickleimport reimport codecsimport stringimport shutilfrom win32com Import Client As WC
Def Dosaveaas (): # Want to batch file, you use for loop, I processed more than 100 files at a time, code execution not more than 2 minutes, can solve the problem, the target file path can be freely changed, we pay attention to the parameters of the SaveAs method, many Ah, don't write wrong
Word = WC. Dispatch (' Word.Application ') doc = Word. Documents.Open (U ' c:\\users\\x\\pycharmprojects\\1\\ hello. doc ') # file doc under the target path. SaveAs (U ' c:\\users\\x\\pycharmprojects\\1\\ I am a little programmer x007.docx ', "," false, "", True, "", False, False, False, false) # Go File doc under the path. Close () word. Quit ()
After converting to a. docx file, in processing. docx files, all the way to the Internet, many solutions, here I will not elaborate, there is a problem, you can give me a message yo