OCR of files in various formats into Word Files

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

OCR of files in various formats into Word Files

How are you worried about changing files in different formats into Word files? All kinds of recognition software have their own defects, and the recognition efficiency is low, which makes you suffer. Some of them can only recognize words, and there is nothing to do with tables and graphics. After recognition, the layout is messy and unusable. Now, This article summarizes text recognition in various situations to help you master the correct methods and save time, this article provides a perfect solution for all-file tables, graphics, and text recognition in all circumstances:

1. pdf file identification:

1) files that can be directly identified (PDF files saved in text format): Install acrobat 7 Professional Edition, note that it is not Acrobat Reader (downloadHttp://www.xdowns.com/soft/4/136/2006/Soft_29430.html), Directly save it as an RTF file (identify the entire file), or select the Text Selection button on the toolbar, select the text area, and then copy it to word.

2) files that cannot be directly identified (PDF files saved as images): Install office2003 (downloadHttp://www.xdowns.com/soft/188/215/2006/Soft_28356.html), Install the Microsoft Office document imaging tool, and add the Microsoft Office document image writer printer to the printer, and then print the PDF file to the printer, select the storage location of the printed file, and an MDI file is automatically formed, and the file is automatically opened with Microsoft Office document image, select "use OCR to recognize text" under the "Tools" menu. After the recognition is complete, select "send text to word" under "Tools ", finally, the entire PDF file is identified and output to the Word file.

Note: Microsoft Office document image can accurately identify and convert all files into Chinese, English, and tables, but cannot output images to word, instead, all the images in the file are separated into independent image files and placed in a folder with the same name in the same location. Therefore, you can use snagit to open the image and copy it to word. (All recognition software cannot solve the problem of image recognition. Microsoft Office document image can solve this problem very well .)

3) encrypted PDF file: Download the decryption software first (downloadHttp://www.xdowns.com/soft/4/85/2006/Soft_29750.html), After decryption, see 1), 2)
4) Traditional PDF files: 2) after recognizing word, use the "tool" in word-"language"-"simplified Chinese conversion"

2. Identify CAJ files:

1) Local text recognition: Use cajbrowser directly (Http://www.xdowns.com/soft/4/136/2006/Soft_29737.html) OCR
2) full file recognition: print to the Microsoft Office document image writer printer, followed by the above 2) the operation is the same
3) download the full text of the boshuo paper: Read the boshuo paper online. After you can see the last page, do not close the cajbrowser and find a large file in the cache under the CAJ installation directory, copy it to another location. Then use 2) convert all to word.

3. Identification of superstar files:

1) Local text recognition: Use the superstar browser directly (Http://www.xdowns.com/soft/31/91/2006/Soft_27810.html) OCR
2) full file recognition: print to the Microsoft Office document image writer printer, followed by the above 2). Note that the super star printing function is somewhat different, because the superstar directory is separated from the full text, you need to identify the Directory and body into the word separately during printing and merge them together. When printing, enter the printed page number from 1 to the last page. Do not select print all. In addition, in the print option, you must set the page proportion to the actual size, rather than the whole width. Note: The recognition speed is much slower than that of other formats. Please be patient, but at last, you will be ecstatic when you see the word version of the entire book easily generated. My test result is a 280-page book, which takes several minutes to identify.

3) The Super Star is relatively troublesome. If there are still problems, you can print the super star into a complete PDF file, and then convert it into a word using the method 1.

4. Recognition in other situations:

Use snagit software (downloadHttp://www.xdowns.com/soft/31/46/2006/Soft_29690.html) Any form of text can be converted into an image. For example, you can use snagit to copy the screen into an image, right-click the image file, and open the image with Microsoft Office document image. The others are the same as 2.

Note: Do not use other identification software, because you can only recognize Chinese characters, English letters, entire files, or screen copies, either the identification error is very high, the table cannot be identified, or the registration is required, or the recognition speed is very slow, or the use is inconvenient (not tightly integrated with word), these software include: Ziguang OCR, Wanfang pdfocr, shang Shu, Han Wang, scansoft PDF converter, ipv2word, and various recommended software have been installed and deleted like LJ. If you have installed acrobat Professional Edition, snagit, and office2003, you can do everything perfectly now. The most important thing is that these software is very good.

Supplement to some questions:
Some experiments have found that Microsoft Office document image has some unstable problems. For example, when printing to the Microsoft Office document image writer printer using CAJ, it is found that caj5.5 is faster, (caj5.5 cannot be updated), while caj5.0 sometimes has a false crash.
When the page is displayed, the conversion recognition rate is high.
If the number of pages of a file is large, including the superstar, the file can be converted multiple times if there is a problem.

Supplement:
1. because the process of printing data to Microsoft Office document image writer is slow and the resulting Virtual File is large, the size of a 200-page book is about 60 MB, therefore, it will seriously affect the running speed, drive c space, and memory space of the machine. We recommend that you configure a machine with a conversion of no more than 200 pages, and set a configuration difference of no more than 100 pages, at the same time, a printer diagram is displayed in the system bar in the lower right corner of the page. You can double-click the printer to view the progress of the print task, so as not to crash. In addition, delete the virtual print files in the C:/Windows/TEMP directory after the conversion is completed. Otherwise, your C drive will soon be used up.

2. We recommend that you print the file to the snagit virtual printer first if it is slow or false to Microsoft Office document image writer. The TIFF file is generated automatically, which is faster than Microsoft Office document image writer, in snagit, select the Microsoft Office document image writer printer as the printer (equivalent to printing it to the Microsoft Office document image writer printer), and then select the printer under snagit --- outputs, select snagit ---- file ---- finish output to generate the MSI file. After the conversion, delete the temporary C:/Windows/systems32/snagit files.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

OCR of files in various formats into Word Files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

OCR of files in various formats into Word Files

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support