Chinese dialect convert, an open source dialect word segmentation and conversion software, was born.

Source: Internet
Author: User

With the popularity of xuanyuan sword, Chen jingqiu, the leading man in the film, is famous all over the country as the Prince of the idol drama dialect. His standard Shaanxi Hanzhong dialect highlights the audience's smile, dialects are gradually accepted and presented on the stage of art in various forms, bringing joy to everyone. At the same time, the establishment of audio databases to protect local dialects is also increasing throughout the country.

According to Li Weihong, Deputy Minister of Education and director of the National Language Commission, the Chinese Language Resource audio database comprehensively and scientifically describes and displays the faces of Chinese minority languages and Chinese dialects, protecting the cultural heritage of ethnic languages. The biggest feature of this database is "sound". It will collect real speech and build a corpus of real speech and its transcription text.

"The Language Resource audio database will record how Chinese people speak in the 21st century. After 50 years and 100, let son and Sun hear me ." Said Li Yuming, deputy director of the National Language Commission and Director of the Ministry of Education.

Now that we are able to develop software for translating Chinese into English, why not develop a word segmentation translation software for Mandarin to dialect to protect the national cultural heritage. Chinese dialect convert is a software based on dialect word segmentation translation. The core of the software is the maximum-granularity word divider based on the dialect word segmentation dictionary. The following describes the structure and implementation of the Chinese dialect convert software.

Figure 1 Overall Software Process

Dialect Divider

The dialect word divider is implemented based on IK analyzer and is modified accordingly. The dialect word divider uses the Chinese largest Word Segmentation technology based on the word dictionary for word segmentation. The dictionary storage adopts the same structure as Ik, that is, the array is added with a hash table. The sentence segmentation adopts the forward maximum granularity. The largest matching WORD Based on the Dialect Dictionary is the word element. Of course, the segmentation granularity in different contexts should be variable, however, for the sake of simplicity, all words are segmented based on the maximum match.

Configurationdialect class: it is mainly responsible for the instantiation of various dialect word divider and the initialization and startup of the singleton mode dictionary.

Landictionary: the loading of dialect dictionaries and the loading of extended dictionaries.

Shanxisegmenter class: implements the Shaanxi dialect word segmentation, mainly based on the Shaanxi Dialect Dictionary for forward maximum Segmentation

Shanxisegmenter class: implements the Sichuan dialect word segmentation, mainly based on the Sichuan Dialect Dictionary for forward maximum Segmentation


Figure 2 dictionary storage structure

The dictionary stores Mandarin tokens that can be used for dialect word segmentation conversion. In the storage structure, hash tables are used for storing more than 3 elements in the array.

Ik word Divider

The existing structure and class of the IK tokenizer are retained.

Ikanalyzerseg class: Provides the IK word segmentation interface, which uses IK for word segmentation for unmatched words. Here, we will not describe how the IK splitter performs Chinese word segmentation.

Database Design

Because of the wide variety of dialects, it is obviously impossible to store all the dialects in the memory. The Mandarin lexicon that can be used for dialect word segmentation conversion is stored in the memory for Chinese Word Segmentation of sentences, however, the dialects corresponding to the Mandarin word, MD5 values used for similarity matching, and keywords are stored in the database, in the translation process, you need to query the tables corresponding to each dialect in the database to obtain the results. (The character set is in utf8 format)

Figure 3 structure of database dialect table

Result collector

The result collector encapsulates the database query interface and provides a unified external interface. You only need to enter the corresponding Mandarin sentence or word element to obtain the dialect translation result, and a similar word priority queue for unmatched words, you can select each similar word in the database to supplement the limitations of dictionary-based translation.

For example, the dictionary contains only the word "mom". However, if a user inputs a sentence as a mother or mother, it is actually a concept. However, if not all the words are included in the dictionary, similar words need to be searched, first, the concept of "mom" is obtained based on the net dictionary. MD5 values of the first three concepts are extracted and merged respectively, and then the merged MD5 values are used to fuzzy query the database, the purpose is to obtain the word meta containing the same concepts and calculate the similarity between the query words and the similar words with the same concepts, then, a similar word priority queue is established for each Mandarin dollar that has not been translated.


Figure 4 result collector sorting process

Important auxiliary classes

Original file extraction class: Extractfile class, which creates a database based on the provided dialect and Mandarin files.

Main functions: 1. Extract dialects and their corresponding Mandarin 2. Extract Mandarin keywords 3. query the Knowledge Network conceptual set. Extract the first three conceptual sets corresponding to the Mandarin. If not, extract all conceptual sets, extract the MD5 value of each concept set and combine these MD5 values for fuzzy query of similar words.

Original file format: preferably dialect-Mandarin

Database Connection guides:Connectionhelper class, supporting database driver loading and connection Establishment

The database uses the mysql-5.5.20-win32 version, using JDBC to connect data

Due to limited capabilities, this open-source Chinese dialect word segmentation software has many improvements in its design and implementation. At the same time, I think with the expansion of the word library, as well as the in-depth study of Mandarin-to-dialect semantics, it is still possible to reasonably convert Mandarin into corresponding dialect statements. I also hope that, as an open-source software, everyone can communicate with each other and jointly improve this dialect conversion software to make its structure more reasonable, its translation more accurate, and its dictionary more complete, sharing the power of our ITER for the preservation of national cultural heritage.

I would like to thank the author of IK analyzer and the author of semantic similarity calculation based on HowNet, and an open-source library for word similarity calculation implemented by a member of Beihang, I can combine their programs and ideas to initially implement the design and implementation of the dialect converter software.

Appendix:

The software SourceForge address: https://sourceforge.net/projects/chinese-dialect/

Oschina address: http://www.oschina.net/p/chinese-dialect/edit

My email: handsomestone@gmail.com

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.