Java _ open-source framework _ JPinyin open-source Java library for converting Chinese characters to PinYin

Source: Internet
Author: User

Java _ open-source framework _ JPinyin open-source Java library for converting Chinese characters to PinYin
 
1. Introduction

 

JPinyin is an open-source Java class library for converting Chinese characters to pinyin. Based on the functions of PinYin4j, some improvements have been made.
[Main features of JPinyin]
1. accurate and complete font library;
Unicode encoding can be used to convert all Chinese characters except 46 Chinese characters (excluding standard Chinese characters) in the range of 4E00-9FA5 and 3007 (20903;
2. Fast PinYin conversion;
It was tested that the conversion of Unicode encoding from 20902 Chinese characters in the range of 4E00-9FA5 took about 100 milliseconds.
3. Support for output of multiple pinyin formats;
JPinyin supports multiple pinyin output formats: output formats with phonetic symbols, without phonetic symbols, numbers indicating phonetic symbols, and first letter of Pinyin;
4. Recognition of common polyphonic words;
JPinyin supports the recognition of common polyphonic words, including phrases, idioms, and place names;
5. Simplified and Traditional Chinese Conversion

 


2. Simple Optimization

 

In fact, the conversion between each other is achieved through key-value pairs, such as a variety of mappings. since the authors use static to save these data dictionaries, these conversion functions are not commonly used in applications. Therefore, it is optimized that objects are released when memory is insufficient by instantiating them.

 

The optimization scheme here is to store the db data dictionary under the asset/py directory.

 

Public class PinyinHelper {public Properties PINYIN_TABLE; public Properties MUTIL_PINYIN_TABLE; public ChineseHelper chineseHelper; public String PINYIN_SEPARATOR =,; // pinyin delimiter public String delimiter = aeiouv; public String ALL_MARKED_VOWEL =; // All pinyin letters with tones public PinyinHelper (Context context) {AssetManager assetManager = context. getAssets (); PINYIN_TABLE = new Propertie S (); MUTIL_PINYIN_TABLE = new Properties (); Properties chineseProperties = new Properties (); try {PINYIN_TABLE.load (assetManager. open (py/pinyin. db); MUTIL_PINYIN_TABLE.load (assetManager. open (py/mutil_pinyin.db); chineseProperties. load (assetManager. open (py/chinese. db);} catch (IOException e) {e. printStackTrace ();} chineseHelper = new ChineseHelper (chineseProperties);}/*** convert pinyin with tone format to a number that represents pinyin with tone format ** @ Param pinyinArrayString * pinyin with tone format * @ return number indicates pinyin with tone format */private String [] convertWithToneNumber (String pinyinArrayString) {String [] pinyinArray = pinyinArrayString. split (PINYIN_SEPARATOR); for (int I = pinyinArray. length-1; I> = 0; I --) {boolean hasMarkedChar = false; String originalPinyin = pinyinArray [I]. replaceAll (u, v); // replace U in pinyin with vfor (int j = originalPinyin. length ()-1; j> = 0; j --) {ch Ar originalChar = originalPinyin. charAt (j); // search for pinyin letters with tones, if yes, replace it with the corresponding non-tone English letter if (originalChar <'A' | originalChar> 'Z') {int indexInAllMarked = ALL_MARKED_VOWEL.indexOf (originalChar ); int toneNumber = indexInAllMarked % 4 + 1; // Number of tones char replaceChar = tone (indexInAllMarked-indexInAllMarked % 4)/4); pinyinArray [I] = originalPinyin. replaceAll (String. valueOf (originalChar ), String. valueOf (replaceChar) + toneNumber; hasMarkedChar = true; break ;}} if (! HasMarkedChar) {// If you cannot find a pinyin letter with a tone, the description is soft. Use the number 5 to indicate pinyinArray [I] = originalPinyin + 5;} return pinyinArray ;} /*** convert pinyin with tone format to PinYin without tone format ** @ param pinyinArrayString * pinyin with tone format * @ return pinyin without tone */public string [] convertWithoutTone (String pinyinArrayString) {String [] pinyinArray; for (int I = ALL_MARKED_VOWEL.length ()-1; I> = 0; I --) {char originalChar = ALL_MARKED_VOWEL.charAt (I ); char replaceChar = ALL_UNMARKED_VOWEL.charAt (I-I % 4)/4); pinyinArrayString = pinyinArrayString. replaceAll (String. valueOf (originalChar), String. valueOf (replaceChar);} // replace U in pinyin with vpinyinArray = pinyinArrayString. replaceAll (u, v ). split (PINYIN_SEPARATOR); // There may be duplicates in the pinyin after the tone is removed, and the deduplicated Set
  
   
PinyinSet = new LinkedHashSet
   
    
(); For (String pinyin: pinyinArray) {pinyinSet. add (pinyin);} return pinyinSet. toArray (new String [pinyinSet. size ()]);}/*** format the pinyin with tones into the corresponding pinyin format ** @ param pinyinString * pinyin with tones * @ param pinystring: WITH_TONE_NUMBER -- number indicates the tone, WITHOUT_TONE -- no tone, * WITH_TONE_MARK -- pinyin with tone * @ return format conversion */private String [] formatPinyin (String pinyinString, PinyinFormat pinyinFormat) {if (pinysheet = PinyinForm At. WITH_TONE_MARK) {return pinyinString. split (PINYIN_SEPARATOR);} else if (pinycategory = pinycategory. WITH_TONE_NUMBER) {return convertWithToneNumber (pinyinString);} else if (piny?== piny=. WITHOUT_TONE) {return convertWithoutTone (pinyinString);} return null ;} /*** to convert a single Chinese character to a pinyin character in the corresponding format ** @ param c * to a Chinese character in the pinyin format * @ param pinyinFormat *: WITH_TONE_NUMBER -- the number represents the tone, WITHOUT_TONE -- no tone, * W ITH_TONE_MARK -- pinyin with tone * @ return Chinese characters */public String [] convertToPinyinArray (char c, piny?piny=) {String pinyin = PINYIN_TABLE.getProperty (String. valueOf (c); if (pinyin! = Null )&&(! Pinyin. equals (null) {return formatPinyin (pinyin, pinyin);} return null ;} /*** convert a single Chinese character to a pinyin with tone format ** @ param c * convert it to a pinyin Chinese character * @ return String pinyin */public String [] convertToPinyinArray (char c) {return convertToPinyinArray (c, pinyarray. WITH_TONE_MARK );} /*** convert the string to the pinyin format ** @ param str * string to be converted * @ param separator * pinyin separator * @ param pinyinFormat: WITH_TONE_NUMBER -- numbers represent tones, WITHOUT_TONE -- without tones, * WITH_TONE_MARK -- pinyin with tone * @ return String */public String convertToPinyinString (String str, String separator, piny?piny=) {str = chineseHelper. convertToSimplifiedChinese (str); StringBuilder sb = new StringBuilder (); for (int I = 0, len = str. length (); I <len; I ++) {char c = str. charAt (I); if (ChineseHelper. isChinese (c) | c = '〇 ') // determines whether it is a Chinese character or 〇 {// boolean isFoundFlag = false; int RightMove = 3; // combine the current Chinese character with the next three, two, and one Chinese character to determine whether there is a multi-phoneme phrase for (int rightIndex = (I + rightMove) <len? (I + rightMove): (len-1); rightIndex> I; rightIndex --) {String cizu = str. substring (I, rightIndex + 1); if (then (cizu) {String [] pinyinArray = formatPinyin (MUTIL_PINYIN_TABLE.getProperty (cizu), pinyinFormat); for (int j = 0, l = pinyinArray. length; j <l; j ++) {sb. append (pinyinArray [j]); if (j <l-1) {sb. append (separator) ;}} I = rightIndex; isFoundFlag = true; break ;}} if (! IsFoundFlag) {String [] pinyinArray = convertToPinyinArray (str. charAt (I), pinyat); if (pinyinArray! = Null) {sb. append (pinyinArray [0]);} else {sb. append (str. charAt (I) ;}}if (I <len-1) {sb. append (separator) ;}} else {sb. append (c); if (I + 1) <len & ChineseHelper. isChinese (str. charAt (I + 1) {sb. append (separator) ;}} return sb. toString ();} /*** convert a string to a pinyin string in tone format ** @ param str * the string to be converted * @ param separator * the pinyin separator * @ return indicates the pinyin character with tone after conversion * /public String convertToPinyinString (String str, string separa Tor) {return convertToPinyinString (str, separator, pinyator. WITH_TONE_MARK);}/*** determines whether a Chinese character is a multi-tone character ** @ param c * Chinese character * @ return: returns true if the Chinese character is a Chinese character, otherwise, false */public boolean hasMultiPinyin (char c) {String [] pinyinArray = convertToPinyinArray (c); if (pinyinArray! = Null & pinyinArray. length> 1) {return true;} return false ;} /*** get the first letter of the String corresponding to the pinyin character ** @ param str * the String to be converted * @ return corresponds to the first letter of the pinyin character */public String get1_pinyin (String str) {String separator =#; // use # As the pinyin separator StringBuilder sb = new StringBuilder (); char [] charArray = new char [str. length ()]; for (int I = 0, len = str. length (); I <len; I ++) {char c = str. charAt (I); // first, judge whether it is a Chinese character or 0. if not, the character is directly returned if (! ChineseHelper. isChinese (c) & c! = '0') {charArray [I] = c;} else {int j = I + 1; sb. append (c); // searches for consecutive Chinese character strings while (j <len & (ChineseHelper. isChinese (str. charAt (j) | str. charAt (j) = '〇 ') {sb. append (str. charAt (j); j ++;} String hanziPinyin = convertToPinyinString (sb. toString (), separator, pinystrap. WITHOUT_TONE); String [] pinyinArray = hanziPinyin. split (separator); for (String string: pinyinArray) {charArray [I] = string. charAt (0); I ++;} I --; sb. delete (0, sb. toString (). length (); sb. trimToSize () ;}} return String. valueOf (charArray );}}
   
  
Public class ChineseHelper {private Properties CHINESE_TABLE; public ChineseHelper (Properties properties) {CHINESE_TABLE = properties ;} /*** convert a single traditional Chinese character to a simplified Chinese character ** @ param c * the simplified Chinese character to be converted * @ return */public char convertToSimplifiedChinese (char c) {if (isTraditionalChinese (c) {return CHINESE_TABLE.getProperty (String. valueOf (c )). charAt (0);} return c ;} /*** convert a single simplified Chinese character to a traditional Chinese character ** @ param c * the simplified Chinese character to be converted * @ return the converted traditional Chinese character */public char convertToTraditionalChinese (char c) {String hanzi = String. valueOf (c); if (CHINESE_TABLE.containsValue (hanzi) {Iterator
  
   
> Itr = CHINESE_TABLE.entrySet (). iterator (); while (itr. hasNext () {Entry
   
    
E = itr. next (); if (e. getValue (). toString (). equals (hanzi) {return e. getKey (). toString (). charAt (0) ;}} return c ;} /*** convert a traditional Chinese character to a simplified Chinese character ** @ param str * the simplified Chinese character to be converted * @ return */public String convertToSimplifiedChinese (String str) {StringBuilder sb = new StringBuilder (); for (int I = 0, len = str. length (); I <len; I ++) {char c = str. charAt (I); sb. append (convertToSimplifiedChinese (c);} return sb. toString ();} /*** convert simplified Chinese characters to traditional Chinese characters ** @ param str * simplified Chinese characters to be converted * @ return the converted traditional Chinese characters */public String convertToTraditionalChinese (String str) {StringBuilder sb = new StringBuilder (); for (int I = 0, len = str. length (); I <len; I ++) {char c = str. charAt (I); sb. append (convertToTraditionalChinese (c);} return sb. toString ();}/*** determines whether a character is a traditional character ** @ param c * the character to be judged * @ return is a traditional Chinese character and returns true, otherwise, false */public boolean isTraditionalChinese (char c) {return CHINESE_TABLE.containsKey (String. valueOf (c);}/*** determines whether a character is a Chinese character ** @ param c * the character to be judged * @ return indicates a Chinese character and returns true, otherwise, false */public static boolean isChinese (char c) {String regex = [\ u4e00-\ u9fa5]; return String. valueOf (c ). matches (regex );}}
   
  
public class PinyinFormat {private String name;public static final PinyinFormat WITH_TONE_MARK = new PinyinFormat(WITH_TONE_MARK);public static final PinyinFormat WITHOUT_TONE = new PinyinFormat(WITHOUT_TONE);public static final PinyinFormat WITH_TONE_NUMBER = new PinyinFormat(WITH_TONE_NUMBER);protected PinyinFormat(String name) {this.name = name;}protected String getName() {return this.name;}}

 

 


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.