[Original] Use OPENCC library for simple conversion (c + + code)

Source: Internet
Author: User

Recently, the company has a game product, the font has problems, I hope the automatic simple and complex screen automatic conversion behavior, reduce workload.

So I use some of the Windows own conversion functions, but found that a large number of words are abnormal, can not be converted (test Iconv also found that cannot be converted).

Taking this record some OPENCC libraries use the tutorial, which calls the OPENCC library in C + + to complete character conversions.

Note: OPENCC is not a iconv-like library, he is just a code conversion library, do not use in similar iconv scenes, please note the distinction.

introduction of OPENCC :

Open Chinese Convert (OPENCC, opening Chinese conversion) is an opensource project for conversion between traditional Chinese and simplified C Hinese, supporting character-level conversion, Phrase-level conversion, variant conversion and regional idioms among MAINL And China, Taiwan and Hong Kong.

Chinese simple and complex conversion to open source items, to support the conversion of the word level, the conversion of the difference and the use of the local vocabulary conversion (Mainland China, Taiwan, Hong Kong).

Features Special Dot

  • The strict distinction is "simple to multiply" and "one simple to many".
  • Fully compatible with the word, you can actually swap.
  • The strict jury has a simple, multi-word article which is "incompatible".
  • To support the Chinese mainland, Taiwan, Hong Kong and the region to learn to use the word conversion, such as "Inside", "Mouse," mouse.
  • The library and the library are completely separated and can be freely modified, guided and expanded.
  • Supports C, C + +, Python, PHP, Java, Ruby, node. js and Android.
  • Compatible with Windows, Linux, Mac platforms.

Through the above introduction can be found, OPENCC is a relatively perfect conversion font, note here, OPENCC is for simple conversion, not applicable to other national characters, other international languages, please use Iconv.

Attached OPENCC Evaluation: http://linux-wiki.cn/wiki/zh-hans/%E7%AE%80%E7%B9%81%E8%BD%AC%E6%8D%A2

Online simple propagation test: http://opencc.byvoid.com/

For the Windows platform development, Linux, etc. please refer to the official documentation.

1. Installation and compilation OPENCC

OPENCC's Download: Https://github.com/BYVoid/OpenCC

Currently the latest version is: 1.0.4

After the download is uncompressed, after installing CMake, execute the following statement in the path:

Cmake-h.-bbuild-g"Visual Studio "-dcmake_install_prefix= "path/to/ Install"--build build--config Release--target Install

Note here that VS2013 and above are not compatible with XP, please change the settings in SLn.

The path is especially important to note that writing relative paths is prone to situations where duplicate paths cannot be compiled.

If you only need to use the development library here, you only need to execute

Cmake-h.-bbuild-g"Visual Studio "-dcmake_install_prefix= "path/to/ Install"

The build directory is then generated and the following sln is opened.

Note OPENCC useful a lot of C11 features, in less than 2013 version difficult to compile through, if you use less than 2013 call DLL, then be careful not to use the online release of the source code, next I will describe how to convert.

The next step is to compile the OPENCC and publish and integrate the files such as Dll,include, which is not a separate introduction, very simple.

  Here 1.0.4 version of the project: Opencc_phrase_extract is not compiled, there is a corresponding issue on git, delete the item can be, do not affect the use, so do not care about him.

2. Use OPENCC in code

  After completing the above steps, we can formally use OPENCC in our own code to convert to traditional.

  Here is a special note, OPENCC is just a UTF8-based format of the simple transformation Library, does not exist and iconv the same conversion, so the next code will use a lot of boost locale, if you feel unaccustomed, you can replace the iconv.

  

Describe what the configuration file means (excerpt from the official Git):

Configurations configuration file

Preset configuration file
  • s2t.jsonSimplified Chinese to traditional Chinese simple to complex
  • t2s.jsonTraditional Chinese to Simplified Chinese complex to simple
  • s2tw.jsonSimplified Chinese to Traditional Chinese (Taiwan standard) simple to Taiwan
  • tw2s.jsonTraditional Chinese (Taiwan standard) to Simplified Chinese Taiwan to the simple
  • s2hk.jsonSimplified Chinese to Traditional Chinese (HK standard) simplified to Hong Kong complex (Hong Kong School of Small learning vocabulary)
  • hk2s.jsonTraditional Chinese (Hong Kong standard) to Simplified Chinese Hong Kong Complex (HK School of Learning vocabulary) to the simple
  • s2twp.jsonSimplified Chinese to Traditional Chinese (Taiwan standards) with Taiwanese idiom simple to complex (Taiwan standard) and converted to Taiwan's common vocabulary
  • tw2sp.jsonTraditional Chinese (Taiwan standard) to Simplified Chinese with Mainland Chinese idiom complex (Taiwan normal) to the simplicity and conversion to the Chinese mainland's common vocabulary
  • t2tw.jsonTraditional Chinese (OPENCC) to Taiwan, standard complex (OPENCC standards) to Taiwan
  • t2hk.jsonTraditional Chinese (OPENCC standard) to Hong Kong Standard Complex (OPENCC) to HK Complex (Hong Kong School of Learning vocabulary)

Generally for the common use of simple and complex, I recommend here: s2t or t2s configuration files, when it comes to chatting and other content generally recommended to use S2TW or tw2s can, the rest of the test, recommended self-test after the choice.

The content of the configuration file is very simple, corresponding to the corresponding OCD file, take S2t.json as an example:

{  "name":"Simplified Chinese to Traditional Chinese",  "segmentation": {    "type":"mmseg",    "Dict": {      "type":"OCD",      "file":"Stphrases.ocd"    }  },  "Conversion_chain": [{    "Dict": {      "type":"Group",      "dicts": [{        "type":"OCD",        "file":"Stphrases.ocd"      }, {        "type":"OCD",        "file":"Stcharacters.ocd"      }]    }  }]}

I saw a lot of OCD in the content, but you found out there was no OCD on your side. Because OCD needs to be generated using his tools, and not under the data\dictionary directory, but in the build\data directory, So when looking for attention, at the same time if you really lazy to make, you can use TXT,OCD is to speed up the reading, do not need to care about it, if you do not care about this speed difference, it is recommended to use TXT file, in the data\dictionary directory to find, However, the OCD files in the configuration file are modified to TXT, for example:

{  "name":"Simplified Chinese to Traditional Chinese",  "segmentation": {    "type":"mmseg",    "Dict": {      "type":"text",      "file":"STPhrases.txt"    }  },  "Conversion_chain": [{    "Dict": {      "type":"Group",      "dicts": [{        "type":"text",        "file":"STPhrases.txt"      }, {        "type":the text",        "file":"STCharacters.txt"      }]    }  }]}

The configuration file is finished, we can start to write their own code, it is important to note that if you use the lower version of the version such as VS2005, such as the call Opencc.dll, you will find many tutorials on the web is wrong, because it will produce Bad_alloc exception, the specific reason is not compatible, Here, if you are using standard c+ to invoke the arguments without exception (if you have other better ways, please contact me), my workaround is to call directly the C function provided by OPENCC:

For example, write a function that GBK converted to BIG5:

opencc_t GS2TWHWD =NULL;if(GS2TWHWD = =NULL) gs2twhwd= Opencc_open ("S2t.json");//Step 1 Convert to UTF8STD::stringSzconvsert = lc::to_utf<Char> (SZGBK,"GBK"); Szconvsert= Opencc_convert_utf8 (Gs2twhwd,szconvsert.c_str (), szconvsert.size ());//Convert text GBK to BIG5Szconvsert = Lc::from_utf (Szconvsert,"BIG5");//Convert to local character set

As you can see, I use the standard C function, so there is no low version of VC + + compatible with the high version of VC + + problem.

3. Release OPENCC function

When the above code is written, the next thing to do is to publish the OPENCC program, so that you can perfect run up.

Publishing is very simple, for example, when using S2t.json and T2s.json, publish profiles and programs execute file siblings and place OCD or txt files in the same directory.

However, it is important to note that we only need to publish the files we use, and we do not need to publish other completely unused files to increase the size of the publication.

To set up a custom directory method:

For example, I want to put all the configuration files in the lang directory, then load the configuration file to write

GS2TWHWD = Opencc_open ("lang\\s2t.json");

The configuration file is also modified to:

{  "name":"Simplified Chinese to Traditional Chinese",  "segmentation": {    "type":"mmseg",    "Dict": {      "type":"text",      "file":"Lang\\stphrases.txt"    }  },  "Conversion_chain": [{    "Dict": {      "type":"Group",      "dicts": [{        "type":"text",        "file":"Lang\\stphrases.txt"      }, {        "type":"text",        "file":"Lang\\stcharacters.txt"      }]    }  }]}

After the data can be loaded into other locations, of course, you have to test yourself, the above are relative paths, so your program code to manage the corresponding directory, or more prone to anomalies.

[Original] Use OPENCC library for simple conversion (c + + code)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.