Research on text breaking and typographical algorithms in text editors

Last Update:2018-12-07 Source: Internet

Author: User

Tags arabic numbers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Text breaking and formatting in the text editorAlgorithmResearch

Yuan Yongfu

Text Editor is a very complex graphic software. Many of the development skills and software structures involved are traditional databases.ProgramIt has never been used in development, so there are very few people who master related technologies. The text disconnection and typographical algorithms are one of the core algorithms used in editor development. If you do not master this algorithm, you can only make a small effort on the basis of open-source software.

This article discusses the document disconnection and typographical algorithms in the editor.

Text layout is roughly divided into the following steps:

Measure the width and height of each character. [Yuan Yongfu copyright]
Calculate the customer zone width of the document container. For example, the width of the specified paper minus the width of the left and right margins. Here, the document container not only refers to a large body area, but also contains documents such as cells and text boxes.
Break the line, that is, place each character from left to right, from top to bottom in the document container. Generate a line of text to implement a stream layout.
In-line layout, that is, character layout in the document line, especially to complete the alignment function of the document content.
Page.

■ Measure character size

The first step is to calculate the width and height of each character in the document. I developed it using C #, so we can call the system. Drawing. Graphics. measurestring method to measure the width and height of characters. Since there are many characters in the document, for example, tens of thousands of characters, one measurement consumes a lot of time. Therefore, many optimization methods are required to accelerate the measurement. [Yuan Yongfu copyright]

When it comes to measurement characters, it involves the concepts of equal-width fonts and proportional fonts. The same font is used to draw characters with the same width. For example, "" is an equal-width font, use it to measure and draw the letters "W" and "I" with the same width. A proportional font is used to measure and draw characters in different widths, such as "Times New Roman". It is used to measure the letters w and I ", its width is different.

For an equal-width font, you can measure the width of a character in advance, for example, "W", then use the width that has been measured when other characters are encountered. For a proportional font, real-time measurement is required.

However, in general, the width of Chinese characters is the same for the same width font and proportional font. Therefore, the width of a Chinese character can be measured. In the future, the width of the Chinese character is measured in advance.

There is a problem here. To determine whether a character is a Chinese character, you need to refer to the standards of computer character sets such as gb3212 and GBK. In general, Unicode characters ranging from 19968 to 40869 are Chinese characters. For further optimization, you can know some full-angle characters whose width is equal to Chinese characters.

However, it is not reliable to judge whether it is a Chinese character based on unicode encoding alone. Because the meaning of the same UNICODE character may be different in different fonts. [Yuan Yongfu copyright]

For example, for the font "wingdings", all the characters in this font are completely changed, it indicates symbols of specific shapes, and it makes no sense to judge whether it is a Chinese character; this is also the case for bar code fonts.

The most safe way is to directly parse the Font Binary File (the extension is TTF or TTC), obtain the font outline information, and then calculate the character width based on the Unicode encoding value, this is the most accurate and reliable. I guess this method may also be used internally in graphics. measurestring. However, the editor parses the Font Binary file to measure the characters, bypassing many calling layers at the underlying layer. The speed can be very fast and tens of thousands of characters can be measured within dozens of milliseconds. [Yuan Yongfu copyright]

However, it takes a long time to parse the Font Binary file information. For example, for, the font file name is simsun. TTC. The file size is 15 MB and contains the outline information of 28762 characters. However, the analysis result contains a small amount of information, which is only 1424 bytes. Therefore, you need to save the analysis results in a temporary file. You do not need to analyze the Font Binary File next time.

■ Line Disconnection

After the character size is measured, the editor program starts to construct a typographical object model in the memory, constantly filling the characters into the last line of the document, if the character width of the document line and the width of the characters to be added are greater than the width of the customer area of the document container, the line is broken and the character starts to be filled in with another line.

However, there is also a situation where the row is broken in advance. To minimize the possibility of a broken line between consecutive English letters and Arabic numbers, words that are logically closely related to each other are split into two rows. In this case, you need to break the line in advance.

To this end, the program needs to judge when executing the broken line. If the next character and the last few characters in the line of the document are English letters or Arabic numerals, you need to traverse the last line of the document from right to left, extract the relevant characters, and place them in the next line. [Yuan Yongfu copyright]

Of course, this operation is not absolute. For example, when a continuous super-long "word" is encountered, for example, the 100 consecutive character "A", although it basically has no practical significance, however, this is a boundary condition that must be considered and can easily cause program running errors. Therefore, you need to make such a judgment when the line is broken in advance. If this happens, cancel the line in advance.

※Pre-and post-punctuation

It cannot appear at the end of a line. For example, "([{· '" <"[[. [{￡￥ "; the symbol at the beginning of a line cannot be called a post-punctuation, for example,"!),. :;?]} Â ˇ ˉ ── ―‖'"... ,. Legal disclaimer> "']. "'),.:;?] Bytes | }~ ".

For example, the content of a text line is "? Zhang San, Li Si, Wang Wu [", this is a non-standard text line, need to avoid this situation.

When a text disconnection occurs, if the last character of the document line is a prefix punctuation, the line must be broken in advance. For example, if the first character after a decisive line is a postfix punctuation, you also need to break the line in advance.

When breaking a line, you must perform some special processing on the paragraph symbols. The paragraph symbol itself has a certain width, but when the document line is interrupted, the width involved in the calculation can be regarded as zero.

In the typographical programming practice, the author uses the stack method to achieve line disconnection. First, all the characters to be typeset are pushed into a stack, and then a character element is cyclically obtained from the stack peek, and then an attempt is made to add it to the current document line, if the remaining space of the document line is sufficient to accommodate new characters, add the new character to the document line and execute the pop operation on the stack. If the remaining space of the document line is not enough, do not perform the pop operation. Create a document line to start a new loop. If the line is broken in advance, you need to remove several character elements from the current line of the document and press them into the stack to use them in the next loop.

When the stack content is empty, the system jumps out of the loop to complete the broken line operation of the document. [Yuan Yongfu copyright]

※Stop a row

Users frequently enter characters during editing, which enables the program to frequently perform document formatting. When there are many documents, such as tens of thousands of characters, it may take several hundred milliseconds to typeset the entire document range and redraw the user interface, in this way, the editor is slow to respond when users enter characters.

For this reason, you must edit and enter text in some areas of the document content. For this reason, a technique is used in the compilation process to reduce the workload of the layout, which is called the stop row technique.

Before formatting, back up the document line information of the document container. Each time a broken line is completed, a new document line is formed. Traverse the backup line information and compare the content from the last line and the new line. The comparison is mainly about whether the document elements in the document line are completely consistent. Of course, there are some other judgments. When the content of the old and new lines is the same, the old line is called the stop line. In this case, the document content is terminated ahead of schedule. The new document line is then formatted in the row, and the new document line is combined with some old document lines to form a new document layout. This greatly reduces the typographical workload during running. [Yuan Yongfu copyright]

■ Intra-row Layout

After the text is disconnected, You need to typeset the text in the line.

The sum of characters in a document line is unlikely to be exactly the same as the customer area width of the document container. There will be a gap between the two.

Because the width of Chinese characters is different from that of English characters, the width of English characters, numbers, and other characters varies depending on the width of the font. So that the character width of each text line is different, so that the right edge of each document line is uneven. This seriously affects the appearance.

To this end, we need to pull the width of the document line into the customer area width of the document container, which will create a lot of extra blank space. In this case, we need to evenly allocate these blank spaces to each character. The distribution is relatively even, but not completely even. There is a certain distribution algorithm.

In the same row, the characters are not relatively isolated and logically divided into a group. For Chinese characters and punctuation marks, they are separated into one group by one. For consecutive English letters and Arabic numbers, they are logically in the same group and form a complete word together. Therefore, the characters in the same group should be closely connected, it cannot be split. [Yuan Yongfu copyright]

To share the extra space caused by the alignment of the text on both sides, you must first group the characters in the line of the document, and then evenly distribute the extra spaces to the character group.

For example, for the text "dcwriter electronic medical records Text Editor .", It is grouped into "[dcwriter] [electricity] [sub] [Disease] [calendar] [text] [Book] [editing] [series] [machine] [.]", One of the braces is a group of characters, which is divided into 11 groups. If the extra white space width is 20 units, You Need To evenly spread the white space to these character groups, the last group is not apportioned, therefore, the first 10 groups are allocated 20 Gb/s (11-1) = the blank width of 2 units. Insert the 10 2-unit blank width into the character group during the formatting, so that the width of the document line can be stretched to the same width as the customer area width of the document container.

■ Paging

In essence, pagination is used to calculate the location of the split line. The process is as follows:

First, calculate the height of the standard page, that is, the paper height minus the value of the top and bottom margins. You also need to consider the correction volume of the header and footer.
Set the position of the current page Shard, that is, the position of the previous page shard plus the standard page height.
Traverse the document line. If the split line is located in the middle of the document line, the text in this line is divided into two pages. At this time, the split line is moved up, the split line is in the middle of the upper edge of the current document line and the lower edge of the previous document line.
So that the height of all document pages is greater than or equal to the height of the document content. [Yuan Yongfu copyright]

When paging is performed, you also need to determine many boundary conditions. For example, when a document line is very high, for example, an ultra-high image is placed in the middle, this document line is higher than the standard page height, in this case, you cannot freely move the location of the split line.

In addition, when there is a table in the document, you need to go deep into the table cell to modify the location of the page segmentation. This is a recursive operation.

In the electronic medical record service, the printing function is continued. In the implementation of the author, the position is actually a special split line, in this way, the text cannot be separated and printed during continuous typing.

The text line breaking and typographical algorithms are very complex. Even after long-term reconstruction and re-reconstruction, optimization and optimization, I still spent more than 10 thousand lines of C #CodeTo achieve this function, and there are still many places to be optimized.

Some people think that C # cannot develop high-performance programs. The editor should be developed using C ++. After practice, I believe that the so-called C # performance is not high, the key is the algorithm. C # The program is slow to start and can still achieve high performance after running. [Yuan Yongfu copyright]

For the editor software, visit www.dcwriter.cn.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More