How to read Burmese on a computer using ABBYY OCR recognition technology

Source: Internet
Author: User
Tags lowercase

Union Of Myanmar Republic, formerly known as Burma, is a country in Southeast Asia, and from 1962 to 2010 Burma has been ruled by the junta after the coup, and it has been open to outsiders for the last 5 years and has established trade and cultural ties with other countries.

Burmese is made up of many dialects, but all dialects share a core alphabet, which is mainly used in formal and print media, with 33 consonants and 12 auxiliary characters, and regional dialects may also use other characters, and the complete list is about three times times the size of the core alphabet. Luckily, our job is to identify the standard Burmese text, which uses a popular 10-point size Myanmar 3 font, and the text image can be grayscale, black-and-white or color, at least 300dpi, with a typical Burmese text sample:

Using ABBYY OCR technology to read Burmese on computers

At the initial stage of the project, we must achieve 75% OCR accuracy with a minimum target accuracy of 94%.

Burma script is called alphasyllabary, where each consonant letter also conveys the "default" vowel sound, and other vowels are transcribed using special characters and vowel tones above, below, front, back, or even around consonants.

The letters are mostly composed of semicircle, because in the past, the text is written on the palm leaf, easily damaged by a straight line incision.

Burmese is a tonal language with three main tones-high, low and creaking, and two minor tones-in-ear harmonic drop tone.

Since the tones are also transcribed in writing, the Burmese script actually has two distinguishable symbols, which may be placed above and below the main letter, or both above and below the main letter, which poses major challenges to OCR software, but not least.

To make things more complex, some combinations of letters can be fused together to form new characters.

In most conventional terms, optical character recognition thunderclap piercing. When OCR software receives an image file, it uses OCR technology to perform some preliminary processing, converts an image to Black-and-white text and corrects visible distortions, and then detects areas that contain different types of text (title, body, footer), photos, and tables, and the text blocks are then parsed into rows, then to words, words to letters, Once the single letter recognition is complete, the text will be reorganized from bottom to top, and the Burmese text image processing and plate detection are the same as in most other languages, but detecting text lines is tricky.

Because of the richness of the accent, it is very difficult to teach the computer to recognize the short line. That's why our algorithms use a lot of functions to represent lines of text, one of which is a fictitious baseline, all the main characters are on this baseline, and the computer needs to know where to draw a baseline to generate a reasonable assumption about a single character.

Computers use statistical data to detect basic lines of text, to collect the necessary data, to observe the peaks on the histogram generated by the black dots that comprise the letters, and on the histogram of the European alphabet, three distinct peaks correspond to the height of the baseline and lowercase letters:

However, in Burmese, the large number of diacritics that are outside the normal width of the line of text cause an extra statistically significant spike in the histogram, and for this reason, our original algorithm for European scripting does not correctly identify the important parameters of Burmese text lines.

In the following graphic, the program correctly detects the first two lines but does not detect the third row:

For the text line detection algorithm, we have to make some adjustments so that it applies to the Burmese text as well.

After the text line is detected, we started looking for the gap between the words and the letters, this time, we used a horizontal histogram, the big gap is assumed to be the gap between the words, the small gap is understood as the gap between the letters, detection of the Burmese text in the gap is almost no problem, unlike Thai, almost no gap. (Our OCR technology can identify Thai language, up to 200 other languages)

After dividing the lines of text into smaller fragments, we try to divide the fragments into single characters, and once again observe the peaks and troughs on the histogram, and the trough corresponds to the possible gap between the letters, some of which can be detected with certainty, and others to be validated by various heuristics.

The following graphic shows the histogram of the English word:

The large number of semicircular characters in Burma's script produces many "bug" peaks and underestimates, making it harder to detect gaps, but the Histogram method also applies to Burmese.

Now we can try to recognize a single character, exactly the letter, which is the graphical representation of the character, but it is not one by one corresponding. In European text, a letter may correspond to multiple characters (for example, uppercase "C" and lowercase "C" are the same letter), and a character may be communicated by more than one letter (for example, the letter "a" may be represented by a different letter in a different font).

There are no standard alphabetical lists, so we compile them manually, specify all the possible characters for each letter, and then translate the letters into characters when the candidate words are produced.

As we noted earlier, there are a large number of distinguishable characters in Burmese scripts, many of which can be merged with their main letters to form new characters:

If a variable note is separated from its letter, we recognize the letter first, then recognize the variable note, and finally get the new letter with the recognition result. If a variable note and its letter form an indivisible unit, we will try to identify the whole.

Fusion characters are so common in the Burmese writing system that we have to upgrade our technology to identify 3500 new letters, which is much more than the amount of work we usually add to a new language.

After the letter recognition is complete, you must translate it into Unicode characters and then compose the word. This process is quite simple for European languages, with one single recognized character and then translated to Unicode, but for Burmese fusion characters, special treatment is required.

There is a specific correct sequence in which the Burmese alphabet must be entered in order for Windows to connect them, and some characters must be entered after all other characters have been entered so that Windows can place them in the correct position at the beginning of the dividing syllable.

For example, type the following word in a text editor:

Users must type characters in the following order of characters:

We have added a special modified module to our technology, make sure that the resulting word complies with these typing rules, after all the text is recognized, the module reads the recognized text again, checks the character order correctly, and Burmese is a very structured language with enough formal rules to support these checks.

It took us 4 months to complete this project, the final recognition accuracy of up to 97% (customer requirements at least 94%), in the future should be recognized more Burmese fonts.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.