Use the COM component of OneNote to implement the OCR function ., Onenoteocr
Background
In the process of business system development, in many cases, it will identify the relevant information in the image and input the information into the system. Now we want to use automated input to achieve the following tasks. After comparing the accuracy of OCR software in Chinese recognition, Microsoft's OneNote was used to develop corresponding functions.
Preparations
Code Implementation Logic
1 public class OrcImage 2 {3 private static readonly string tmpPath = AppDomain. currentDomain. baseDirectory + "tmpPath/"; 4 private static readonly int waitTime = Convert. toInt32 (ConfigurationManager. deleetask[ "WaitTime"]); 5 6 private Tuple <string, int, int> GetBase64 (string strImgPath) 7 {8 return GetBase64 (new FileInfo (strImgPath )); 9} 10 11 /// <summary> 12 // obtain the Base64 encoded 13 /// </summa Ry> 14 /// <param name = "file"> </param> 15 /// <returns> </returns> 16 private Tuple <string, int, int> GetBase64 (FileInfo file) 17 {18 using (MemoryStream MS = new MemoryStream () 19 {20 Bitmap bp = new Bitmap (file. fullName); 21 switch (file. extension. toLower () 22 {23 case ". jpg ": 24 bp. save (MS, ImageFormat. jpeg); 25 break; 26 27 case ". jpeg ": 28 bp. save (MS, ImageFormat. jpeg); 29 break; 30 31 ca Se ". gif ": 32 bp. save (MS, ImageFormat. gif); 33 break; 34 35 case ". bmp ": 36 bp. save (MS, ImageFormat. bmp); 37 break; 38 39 case ". tiff ": 40 bp. save (MS, ImageFormat. tiff); 41 break; 42 43 case ". png ": 44 bp. save (MS, ImageFormat. png); 45 break; 46 47 case ". emf ": 48 bp. save (MS, ImageFormat. emf); 49 break; 50 51 default: 52 return new Tuple <string, int, int> ("unsupported image format. ", 0, 0); 53} 54 byte [] buffer = ms. getBuffer (); 55 return new Tuple <string, int, int> (Convert. toBase64String (buffer), bp. width, bp. height); 56} 57} 58 59 public string Orc_Img (FileInfo fi) 60 {61 // insert image 62 var onenoteApp = new Microsoft to Onenote2010. office. interop. oneNote. application (); // onenote API 63 /******************************** **************************************** *************** /64 string sectionID; 65 onenoteApp. openHierarchy (tmpPath + "newfile. one ", null, out sectionID, CreateFileType. cftSection); 66 string pageID = "{A975EE72-19C3-4C80-9C0E-EDA576DAB5C6} {1} {B0}"; // format {guid} {tab }{??} 67 onenoteApp. createNewPage (sectionID, out pageID, NewPageStyle. npsblkpagenotitle ); 68 /************************************** **************************************** **/69 string notebookXml; 70 onenoteApp. getHierarchy (null, HierarchyScope. hsPages, out notebookXml); 71 var doc = XDocument. parse (notebookXml); 72 var ns = doc. root. name. namespace; 73 var pageNode = doc. descendants (ns + "Page "). fir StOrDefault (); 74 var existingPageId = pageNode. Attribute ("ID"). Value; 75 if (pageNode! = Null) 76 {77 Tuple <string, int, int> imgInfo = this. getBase64 (fi); 78 var page = new XDocument (new XElement (ns + "Page", 79 new XElement (ns + "Outline ", 80 new XElement (ns + "OEChildren", 81 new XElement (ns + "OE", 82 new XElement (ns + "Image", 83 new XAttribute ("format", fi. extension. remove (0, 1), new XAttribute ("originalPageNumber", "0"), 84 new XElement (ns + "Position", 85 new XAttribute ("x", "0"), new XAttribute ("y", "0"), new XAttribute ("z", "0 ")), 86 new XElement (ns + "Size", 87 new XAttribute ("width", imgInfo. item2), new XAttribute ("height", imgInfo. item3), 88 new XElement (ns + "Data", imgInfo. item1); 89 page. root. setAttributeValue ("ID", existingPageId); 90 91 onenoteApp. updatePageContent (page. toString (), DateTime. minValue); 92 93 // thread sleep time, in milliseconds. If the image is large, the sleep time is prolonged to ensure Onenote OCR. 94 int fileSize = Convert. ToInt32 (fi. Length/1024/1024); // the file size unit is M 95 System. Threading. Thread. Sleep (waitTime * (fileSize> 1? FileSize: 1); // values smaller than 1 MB are default: 1 MB 96 97 string pageXml; 98 onenoteApp. getPageContent (existingPageId, out pageXml, PageInfo. piBinaryData ); 99 100 /************************************* **************************************** * ***/101 102 XmlDocument xmlDoc = new XmlDocument (); 103 xmlDoc. loadXml (pageXml); 104 XmlNamespaceManager nsmgr = new XmlNamespaceManager (xmlDoc. nameTable); 105 nsmgr. addNamespace ("one", ns. toString (); 106 107 XmlNode xmlNode = xmlDoc. selectSingleNode ("// one: Image // one: OCRText", nsmgr); 108 string strRet = xmlNode. innerText; 109 110 /************************************* * *******************************/111 112 onenoteApp. deleteHierarchy (sectionID, DateTime. minValue, true); // destroy the original page 113 114 return strRet; 115} 116 117 return "not recognized"; 118} 119}View Code
XML format
1/* XML format of images in Onenote 2010 2 <one: Image format = "" originalPageNumber = "0" lastModifiedTime = "" objectID = ""> 3 <one: position x = "" y = "" z = ""/> 4 <one: Size width = "" height = ""/> 5 <one: data> Base64 </one: Data> 6 7 // The following tags are automatically generated by Onenote 2010. Do not process them in the program. The objective is to obtain the content in OCRText. 8 <one: OCRData lang = "en-US"> 9 <one: OCRText> 10 <! [CDATA [OCR text]> 11 </one: OCRText> 12 <one: OCRToken startPos = "0" region = "0" line = "0" x = "4.251968383789062" y = "3.685039281845092" width = "31.18110275268555" height = "7.370078563690185"/> 13 <one: OCRToken startPos = "7" region = "0" line = "0" x = "39.40157318115234" y = "3.685039281845092" width = "13.32283401489258" height = "8.78740119934082"/> 14 <one: OCRToken startPos = "12" region = "0" line = "1" x = "4.251968383789062" y = "17.85826683044434" width = "23.52755928039551" height = "6.803150177001953"/> 15 <one: OCRToken startPos = "18" region = "0" line = "1" x = "32.031494140625" y = "17.85826683044434" width = "41.10236358642578" height = "6.803150177001953"/> 16 <one: OCRToken startPos = "28" region = "0" line = "1" x = "77.66928863525391" y = "17.85826683044434" width = "31.46456718444824" height = "6.803150177001953"/> 17. ............... 18 </one: Image> 19 */20 21/* ObjectID format 22 The representation of an object to be used for identification of objects on a page. not unique through OneNote, but unique on the page and the hierarchy.23 <xsd: simpleType name = "ObjectID"> 24 <xsd: restriction base = "xsd: string "> 25 <xsd: pattern value = "\ {[a-fA-F0-9] {8}-[a-fA-F0-9] {4}-[a-fA-F0-9] {4}-[a-fA-F0-9] {4}-[a-fA-F0-9] {12} \} \ {[0-9] + \ {[A-Z] [0-9] + \} "/> 26 </xsd: restriction> 27 </xsd: simpleType> 28 */View Code
Currently, desktop applications implement related functions. It is expected that any system can use the OCR function through the webservice interface. However, I encountered a problem when I changed to a web program. I found only a little bit of information on the Internet, which was not solved. I have learned that all the programs that use the OneNote OCR function also use the WinForm program. The OneNote program will be started in the background when the program is running. So I guess it may be because of this reason that it can only be made into a desktop program.
Retrieving components in COM class factory with CLSID {D7FAC39E-7FF1-49AA-98CF-A1DDD316337E} failed because of the following error: 80070005 Access denied. (The exception is from HRESULT: 0x80070005 (E_ACCESSDENIED )). This error is reported on the web, which is a permission issue. According to the configuration of Excel and Word COM, the component with this ID is not found in DCOM. Thank you for your attention.
The process is as follows: the recognition effect is good, and the rest is to match the regular expression based on the required information.