Ma Jian
Email:[email protected]
Published: 2012.07.02
Update:
2012.07.09
Supplemental non-Simplified Chinese version content
Since the release of Modi-based Djvutoy, Freepic2pdf, and Pdg2pic, many people are asking the same question: Can you not install Office 2003/2007 or SharePoint Designer 2007, Let Modi-based software normal OCR? After all, for Simplified Chinese, even if you install only Modi in SharePoint Designer 2007, nearly 650 MB, the office 2007 Modi is more exaggerated and nearly 1 GB.
To solve this problem, we need to start with the origin of the OCR module in Modi. In "Using Modi OCR 21 languages" has been said in the article, Modi actually just encapsulated ScanSoft API, and Google (note I said Google, not Baidu) keyword "ScanSoft API", you can find two pages:
Http://www.corrupteddatarecovery.com/Products/ScanSoft-API.asp
Http://www.corrupteddatarecovery.com/Products/Asian-OCR-for-ScanSoft-API.asp
Inferred from these two posts and some other pertinent information, the ScanSoft API encapsulates the Tsinghua pass to support Asian languages, while Modi encapsulates the Modi interface on the ScanSoft API. Based on this, the Modi application (MSPVIEW.EXE) is provided, and the entire hierarchy should look like the following:
—————— Mspview. Exe
Application Layer ┃
—————— Modi interface
Interface Layer ┃
ScanSoft API
┏━━━━━┻━━━━━┓
┃scansoft API Asian Language Support (Tsinghua text pass)
—————— ┃┃
Data tier Western Europe 11 countries, Eastern Europe 3 countries, Asian language files
Russian, Greek, and Turkish language documents (Jane, Fan, Sun, Chao)
From the look, if you only want to invoke the Modi interface, you do not need application layer support, you can have the following options:
- Call the ScanSoft API directly. This is more difficult, at least I have not found the relevant documents so far.
- Or call the Modi interface, at least the document is public.
So the above problem is converted to: Can extract the support Modi interface of the smallest set, the implementation of OCR interface function?
As a senior technician, the answer to this question is certainly not gibberish, but should be based on experimentation:
1, make a clean XP SP3 virtual machine, and then copy one, let's name "virtual Machine a", "virtual machine B".
2. Install Installrite 2.5 in virtual machine A to monitor the installation of Modi in SharePoint Designer 2007.
3, after the installation is complete, use Installrite to export the new registration form after installation, named Aaa.reg.
4. Two folders in virtual machine a
C:\Program Files\Common Files\Microsoft Shared\modi
C:\Program Files\Common Files\Microsoft Shared\office12
Copy to virtual machine B, and import Aaa.reg to virtual machine B. After all, the Modi interface is a COM interface, and the registry-independent COM interface does not exist.
5, run Djvutoy, freepic2pdf or Pdg2pic, can verify in virtual machine B can use third-party software normal OCR, but every OCR page to automatically install a file, it seems that the registry is mixed with garbage.
The above feasibility experiments clearly show that:
1. If you do not install Office or SharePoint Designer, you can provide OCR support for third-party software by copying related files and registry keys.
2, for Simplified Chinese, the installation of Modi in office 2007 requires approximately 1 GB of hard disk space, the Modi in SharePoint Designer 2007 is approximately 650 MB, and the above two folders add up to approximately MB. Moreover the middle still has the water to squeeze, the space savings still is very considerable, therefore this trade has been done.
Bored but essential theory is finished, the following start to enter the actual combat: which files and registry key is necessary?
First, the file. The above two URLs, in fact, has explained the interface layer in the ScanSoft API required files, the remaining need to solve is the Modi interface section of the file.
The initialization code that calls the Modi interface in third-party software is:
IDocument Doc;
Doc. CreateDispatch (_t ("MODI. Document "));
Search the registry for the string "MODI." Document ", you know that the DLL for this COM object is MDIVWCTL.DLL under the Modi installation folder. Another look at VC + + Debug window, you can know that after calling this DLL, but also called the same folder under the MSPGIMME.DLL, Mspcore. DLL, and MSO.DLL in the OFFICE12 folder. From the file attributes, these files are made by Microsoft, so you can think of the interface layer in the Modi interface part of the thing.
From VC + + Debug window output information to see, in addition to the above DLL file, the OCR process also loaded the Modi installation folder under the XOCR3.PSP, THOCR. PSP, XFile. Psp. These 3 files, although the extension is the PSP, but is actually a DLL file, from the file attributes to see the scope of the ScanSoft API, can be seen as a complement to the contents of the above two URLs.
In addition VC + + Debug window also recorded in the OCR process called the OFFICE12 folder under the OGL.DLL, Msores. DLL, 2052\msointl. DLL, but in later regression tests it was proved that after simplifying the registry keys, these files did not matter.
Combined with the above analysis, and "using Modi OCR 21 languages" in the relevant information, we know that to OCR Simplified Chinese, English, at least the required files are shown in the table below, plus together about three MB. The "description" section of the English is copied from the file attributes of the DLL file, the Chinese is added by myself; Data files are crawled with a file monitor.
Level |
Filename |
Description |
Interface Layer |
MODI |
Mdivwctl. Dll |
Microsoft Office Document Imaging Viewer Control |
Mspcore. Dll |
Microsoft®office Document Imaging Object Library |
Mspgimme. Dll |
Microsoft®gimme Library |
Office12\mso. Dll |
Microsoft Office Component |
scanspft Api |
BINDER. Dll |
XDoc Binder module for the ScanSoft SDK |
PSOM. Dll |
Component Management Module for Pefectscan API |
Ximage3b. Dll |
Image processing Module for the ScanSoft SDK |
xpage3c. Dll |
Page Management Module for ScanSoft SDK |
XOCR3. Psp |
OCR Module for ScanSoft SDK |
XFile. Psp |
Asian OCR Module for ScanSoft SDK |
THOCR. Psp |
Asian OCR Module for ScanSoft SDK |
ScanSoft API Asia Language |
FORM. Dll |
Table Recognition for Asian OCR |
REVERSE. Dll |
Reverse Video Detection for Asian OCR |
Thocrapi. Dll |
Asian OCR API |
TWCUTCHR. Dll |
Character Segmentation for Asian OCR |
Twcutlin. Dll |
Line segmentation for Asian OCR |
TWLAY32. Dll |
Layout Analysis for Asian OCR |
Tworient. Dll |
Orientataion Detection for Asian OCR |
TWRECC. Dll |
Chinese recognition for Asian OCR |
Twrece. Dll |
中文版 Recognition for Asian OCR |
Twrecs. Dll |
Punctuation Recognition for Asian OCR |
Twstruct. Dll |
Document Structure processing for Asian OCR |
Data layer |
English |
LATIN1. SHP |
Western European 11-nation (including English) common feature Library |
Charsettable.chr |
Character encoding conversion table, text file |
中文版. Lng |
English language files |
Chinese Simplified |
Engdic. Dat |
Tsinghua Wen Tong's English dictionary file, it seems that it also supports Chinese and English |
Engidx. Dat |
The English index of Tsinghua Wen Tong |
Jfont. Dat |
|
LOOKUP. Dat |
|
OCRHC. Dat |
|
OCRVC. Dat |
|
TWGB32. Dll |
Simplified Chinese Code Conversion |
Sccode. UNI |
|
Scprint. Dat |
|
SCPRINT2. Dat |
|
Scserht. Dat |
|
Sctree. Dat |
|
Tw_gu. Dat |
|
Tw_ug. Dat |
|
If you also want to increase the ability to OCR in other languages, you can refer to "using Modi OCR 21 languages" to increase the corresponding language of the relevant files.
Also in the table above, a new name appears in the PSOM.DLL file description: Pefectscan API. Google a bit, find its official website http://perfectscan.com/, from the introduction is to do image processing:
Perfectscan is a image processing program a automatically analyzes an image and then makes adjustments to that image rend Ering a black and white image This contains all viewable data as if it were a gray scale scan, only at 15% of the size of A gray scale image.
It seems that Modi is really a hodgepodge. But Pefectscan official online sentence, feel the way the programmer's desolate:
We built the software after ten years of hard work, and now all we had to does is build the website ... oops!!
Anyway After you've done the file, you'll need to do the registry key. The registry key associated with Modi includes two parts: COM related to office.
COM related is the registry key associated with Modi COM components, which can be imported directly with regsvr32: Start the command line, go to the Modi installation folder, and execute the following command:
regsvr32 mdivwctl. Dll
regsvr32 mspcore. Dll
You can complete the registration of the Modi COM component.
But the registry keys associated with Office are not that good. I have tried to monitor the OCR process with the registry monitor and found the truth submerged in the ocean of detail. Finally had to adopt a stupid method: Specially wrote a set of test software, in the previous feasibility experiment built in the virtual machine B run, one after another to try to delete the registry entries from the Aaa.reg import, each delete a check on the OCR will not affect. Final try out about 20 registry entries are essential, and nearly half of them are duplicated with the registry entries that were automatically inserted when the COM component was registered with Regsvr32 earlier.
After the final manual adjustment, confirm in the table above the file and COM registration on the basis, and then add the following registry key to normal use of third-party software in the Simplified Chinese environment OCR Simplified Chinese, English:
[hkey_classes_root\installer\components\61ba386016bd0c340bbeac273d84fd5f]
"2052" =hex (7): 76,00,55,00,70,00,41,00,56,00,53,00,2e,00,7d,00,58,00,25,00,21,\
00,21,00,21,00,21,00,21,00,4d,00,4b,00,4b,00,53,00,6b,00,4f,00,43,00,52,00,\
5f,00,32,00,30,00,35,00,32,00,3c,00,00,00,00,00
[HKEY_CLASSES_ROOT\INSTALLER\FEATURES\00002109F10040800000000000F01FEC]
"ocr_2052" = ""
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\products\ 00002109f10040800000000000f01fec\features]
"ocr_2052" = "%memae,7q9*[email protected]="
[HKEY_CLASSES_ROOT\INSTALLER\PRODUCTS\00002109710000000000000000F01FEC]
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ 3F745FF6A76FF2F4797DB74FC7B3FD8B]
"00002109710000000000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\modi\\12.0\\xpage3c. DLL "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ 4080B9FA1A0BBF34FB7813E87159FC64]
"00002109f10040800000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\modi\\12.0\\sccode. UNI "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ 48ad0082d02b3d24c9a56fa50728ccab]
"00002109710000000000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\modi\\12.0\\mspcore. DLL "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ D94C8360B8BB1DC41B1950E0F8237563]
"00002109710000000000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\office12\\mso. DLL "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\products\ 00002109710000000000000000f01fec\installproperties]
"WindowsInstaller" =dword:00000001
The first three language encoding 2052 corresponds to Simplified Chinese, but as long as the file is not missing, OCR Chinese, English or other languages are not a problem. However, in the face of users in other countries, explaining the use of Simplified Chinese language coding is a bit laborious, so still need to continue to try other language coding.
Restore two virtual machines, repeat the feasibility of the above steps: in virtual machine A to monitor the installation of the English version of SharePoint Designer 2007 (after installation support English, French, Spanish), export the installed registry and files to virtual machine B, with the same set of test software to check, Come out is not with the English language Code 1033 related registry key? Wrong, it's a big mistake. The registry key associated with French (language code 1036) is not deleted:
[hkey_classes_root\installer\components\61ba386016bd0c340bbeac273d84fd5f]
"1036" =hex (7): 76,00,55,00,70,00,41,00,56,00,57,00,3f,00,57,00,41,00,24,00,21,\
00,21,00,21,00,21,00,21,00,4d,00,4b,00,4b,00,53,00,6b,00,4f,00,43,00,52,00,\
5f,00,31,00,30,00,33,00,36,00,3c,00,00,00,00,00
[HKEY_CLASSES_ROOT\INSTALLER\FEATURES\00002109F100C0400000000000F01FEC]
"ocr_1036" = ""
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\products\ 00002109f100c0400000000000f01fec\features]
"ocr_1036" = ") Aemae,7q9*[email protected]="
[HKEY_CLASSES_ROOT\INSTALLER\PRODUCTS\00002109710000000000000000F01FEC]
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ 3F745FF6A76FF2F4797DB74FC7B3FD8B]
"00002109710000000000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\modi\\12.0\\xpage3c. DLL "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ 48ad0082d02b3d24c9a56fa50728ccab]
"00002109710000000000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\modi\\12.0\\mspcore. DLL "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ C040B9FA1A0BBF34FB7813E87159FC64]
"00002109f100c0400000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\modi\\12.0\\french. LNG "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ D94C8360B8BB1DC41B1950E0F8237563]
"00002109710000000000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\office12\\mso. DLL "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\products\ 00002109710000000000000000f01fec\installproperties]
"WindowsInstaller" =dword:00000001
Comparing two results, it can be seen that:
- There are 4 entries for the language-related registry entries, and the other registry keys are the same.
- The first 3 items of the language-related registry keys are identified directly by the language code, the 4th is identified by the language file, the Simplified Chinese file is Sccode.uni, and French is french.lng.
- Language-related registry key values are not random and are based on the product basic key value (the basic key value of SharePoint Designer is 00002109710000000000000000F01FEC). Overlay related language encoding (Simplified Chinese language Code 2052, hexadecimal 0804,intel is represented as 0408; French language encoding 1036,intel is represented as 0C04). Dare to play this GUID I have seen such a, others seem to have no such courage.
- No matter what language you use, you can have only one language, but this language cannot be English (language code 1033).
To verify the 4th, I manually exported the English-related registry entries from virtual machine A, and in virtual machine B all the French-related registry entries were replaced with English, resulting in OCR failure. The 4 registry entries related to English are:
[hkey_classes_root\installer\components\61ba386016bd0c340bbeac273d84fd5f]
"1033" =hex (7) : 76,00,55,00,70,00,41,00,56,00,54,00,28,00,38,00,41,00,24,00,21,\
00,21,00,21,00,21,00,21,00,4d,00,4b,00,4b , 00,53,00,6b,00,4f,00,43,00,52,00,\
5f,00,31,00,30,00,33,00,33,00,3e,00,26,00,61,00,45,00,4d, 00,61,00,65,00,2c,\
00,37,00,71,00,39,00,2a,00,44,00,58,00,64,00,55,00,40,00,45,00,50,00,69,00,\
3d, 00,00,00,00,00
[HKEY_CLASSES_ROOT\INSTALLER\FEATURES\00002109F10090400000000000F01FEC]
"OCR_1033" = ""
[hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\products\ 00002109f10090400000000000f01fec\features]
"ocr_1033" = "&aemae,7q9*[email protected]=ofu[' t.WO9zoh+ X^{bhe "
[Hkey_local_machine\software\microsoft\windows\currentversion\installer\userdata\s-1-5-18\components\ 9040B9FA1A0BBF34FB7813E87159FC64]
"00002109f10090400000000000f01fec" = "C:\\Program Files\\Common Files\\Microsoft Shared\\modi\\12.0\\english. LNG "
Therefore, for domestic users, directly on the Simplified Chinese registry key is good; for foreign users, if the Simplified Chinese explanation is more troublesome, on the French or other language of the registry. But notice only one language at a time, and cannot be English. If on the basis of a language registry key, and then the English registry entries, there will be a problem: because the front-facing files and registry keys are streamlined, the Modi component will attempt to restore the streamlined file if it discovers an English-language registry key during OCR, causing the OCR to be slow.
Of course, regardless of which language, in the import x64 windows, you have to pay attention to x64 under the 32-bit software program Files folder more than a suffix, changed to call programs Files (x86). In addition, in theory Program Files folder can also not be on the C disk, so the most safe way is through the environment variables CommonProgramFiles, CommonProgramFiles (x86), or shgetfolderpath such as SDK functions to get the actual folder name.
Another question that is easy to ask is: the seemingly imported registry key contains the file path, then can you change the registry key method, the Modi installation path, not installed in the CommonProgramFiles folder? The answer is: No. In fact, the original registry key in the Installrite 2.5 exported Aaa.reg file contains an illegal character (question mark) in the pathname, but it does not affect use, so I suspect that the folder was written dead in the Modi code.
All in all, it's all about the last resort without office 2003/2007, and if it's already installed, it won't save much space, and there may be conflicting files or registry entries.
Finish
OCR modules in the MODI