Capture jpg and PDF text

Source: Internet
Author: User

Simple crawling of text on PDF
You may have many tools to capture text in PDF files. Today I want to introduce you to a simple and affordable way to capture text in PDF files.
Open the PDF file and select print. If Microsoft Office 2003 is installed on your system, you can select Micosoft Office document image writer, the image is then printed to an image file with an MDI extension. then we can edit this file and select "send text to word" in the menu tool. It will automatically process the image and convert the text on the image to word.

Convert JPG text to word
In our daily work, we may encounter the need to re-modify or typeset the scanned table or article. However, we all know that the scanned file format is image format. generally, only high-end scanners can scan the image in the Word format. however, if we only have regular scanners and want to modify the scanned documents, I will teach you how to achieve our goal under the conditions.
First, we need to prepare Office 2003, and then open the Micosoft Office document scanning tool in offcie. After setting, we can right-click the image to be modified, select print, and continue the next step, when you select a printer, select Micosoft Office document image writer and print the image to the image format with the MDI file extension. click "send text to word" in the menu ". this is simple and convenient.

 

Conversion of files in various formats into Word files Methods Various formats to convert files into Word files methods of various formats to convert files into Word files methods how are you still changing files in different formats is Word files worrying? All kinds of recognition software have their own defects, and the recognition efficiency is low, which makes you suffer. Some of them can only recognize words, and there is nothing to do with tables and graphics. After recognition, the layout is messy and unusable. Now, This article summarizes text recognition in various situations to help you master the correct methods and save time, this article provides a perfect solution for all-file tables, graphics, and text recognition in all circumstances:

1. pdf file identification:

1) files that can be directly identified (PDF files saved in text format): Install acrobat 5 Professional Edition, note that it is not Acrobat Reader, and directly save it as an RTF file (identify the entire file ), alternatively, select the Text Selection button on the toolbar, select the text area, and copy it to word.

2) files that cannot be directly identified (PDF files saved in the form of images): Install office2003 and install Microsoft Office document imaging ), then, the Microsoft Office document image writer printer will be added to the printer, and the PDF file will be printed to the printer. Select the Save location of the printed file, and an MDI file will be automatically formed, in addition, the file is automatically opened with Microsoft Office document image, and "use OCR to recognize text" under the "Tools" menu is selected. After the recognition is complete, "Send text to word". At last, the entire PDF file is identified and output to the Word file.

Note: Microsoft Office document image can accurately identify and convert all files into Chinese, English, and tables, but cannot output images to word, instead, all the images in the file are separated into independent image files and placed in a folder with the same name in the same location. Therefore, you can use snagit to open the image and copy it to word. (All recognition software cannot solve the problem of image recognition. Microsoft Office document image can solve this problem very well .)

3) encrypted PDF file: Download the decryption software first. After decryption, see 1), 2)
4) Traditional PDF files: 2) after recognizing word, use the "tool" in word-"language"-"simplified Chinese conversion"

2. Convert JPG text to word
In our daily work, we may encounter the need to re-modify or typeset the scanned table or article. However, we all know that the scanned file format is image format. generally, only high-end scanners can scan the image in the Word format. however, if we only have regular scanners and want to modify the scanned documents, I will teach you how to achieve our goal under the conditions.
First, we need to prepare Office 2003, and then open the Micosoft Office document scanning tool in offcie. After setting, we can right-click the image to be modified, select print, and continue the next step, when you select a printer, select Micosoft Office document image writer and print the image to the image format with the MDI file extension. click "send text to word" in the menu ". this is simple and convenient.

 

3. Identify CAJ files:

1) Local text recognition: Use OCR of cajbrowser directly
2) full file recognition: print to the Microsoft Office document image writer printer, followed by the above 2) the operation is the same
3) download the full text of the boshuo paper: Read the boshuo paper online. After you can see the last page, do not close the cajbrowser and find a large file in the cache under the CAJ installation directory, copy it to another location. Then use 2) convert all to word.

4. Identification of superstar files:

1) Local text recognition: Use the OCR function of the superstar browser directly.
2) full file recognition: print to the Microsoft Office document image writer printer, followed by the above 2). Note that the super star printing function is somewhat different, because the superstar directory is separated from the full text, you need to identify the Directory and body into the word separately during printing and merge them together. When printing, enter the printed page number from 1 to the last page. Do not select print all. In addition, in the print option, you must set the page proportion to the actual size, rather than the whole width. Note: The recognition speed is much slower than that of other formats. Please be patient, but at last, you will be ecstatic when you see the word version of the entire book easily generated. My test result is a 280-page book, which takes several minutes to identify.

3) The Super Star is relatively troublesome. If there are still problems, you can print the super star into a complete PDF file, and then convert it into a word using the method 1.

5. Recognition in other situations:

Use the snagit software to convert any form of text into an image. For example, use snagit to copy the screen into an image, right-click the image file, and use Microsoft Office document image to open the image) same.

Note: Do not use other identification software, because you can only recognize Chinese characters, English letters, entire files, or screen copies, either the identification error is very high, the table cannot be identified, or the registration is required, or the recognition speed is very slow, or the use is inconvenient (not tightly integrated with word), these software include: Ziguang OCR, Wanfang pdfocr, shang Shu, Han Wang, scansoft PDF converter, ipv2word, and various recommended software. I have installed them and deleted them like garbage. If you have installed acrobat Professional Edition, snagit, and office2003, you can do everything perfectly now. The most important thing is that these software is very good.

Supplement to some questions:
Some experiments have found that Microsoft Office document image has some unstable problems. For example, when printing to the Microsoft Office document image writer printer using CAJ, it is found that caj5.5 is faster, (caj5.5 cannot be updated), while caj5.0 sometimes has a false crash.
When the page is displayed, the conversion recognition rate is high.
If the number of pages of a file is large, including the superstar, the file can be converted multiple times if there is a problem.

Supplement:
1. because the process of printing data to Microsoft Office document image writer is slow and the resulting Virtual File is large, the size of a 200-page book is about 60 MB, therefore, it will seriously affect the running speed, drive c space, and memory space of the machine. We recommend that you configure a machine with a conversion of no more than 200 pages, and set a configuration difference of no more than 100 pages, at the same time, a printer diagram is displayed in the system bar in the lower right corner of the page. You can double-click the printer to view the progress of the print task, so as not to crash. In addition, delete the virtual print files in the C:/Windows/TEMP directory after the conversion is completed. Otherwise, your C drive will soon be used up.

2. We recommend that you print the file to the snagit virtual printer first if it is slow or false to Microsoft Office document image writer. The TIFF file is generated automatically, which is faster than Microsoft Office document image writer, in snagit, select the Microsoft Office document image writer printer as the printer (equivalent to printing it to the Microsoft Office document image writer printer), and then select the printer under snagit --- outputs, select snagit ---- file ---- finish output to generate the MSI file.

3. caj5.5: you cannot download boshuo papers or open the downloaded papers. You must download them using caj5.0.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.