Fast 16-Color Conversion Algorithm

Source: Internet
Author: User

File: fast16c.txt
Name: fast 16-Color Conversion Algorithm
Author: zyl910
Blog: http://blog.csdn.net/zyl910/
Version: V1.0
Updata: 2006-11-29

Download (note the modified extension)

I. Problem Description

For 16-color (4-bit) images, VGA uses a bit plane, while Dib uses a linear approach. No matter which method is used, complicated bit splitting operations are required when accessing a single pixel, making it difficult to program efficiently in this color mode. In particular, the conversion between the two color modes requires extremely complex bit-level splitting/shuffling operations, which is very difficult to implement efficiently. This article focuses on efficient 16-color conversion algorithms.

For ease of explanation, we call the eight consecutive pixels (from left to right) a, B, c, d, e, f, g, and h respectively. Each bit describing these pixels is represented by a number. For example, a0 represents the D0 bits of pixel A (from the number on the left: pixel 0 ):
Pixel 0: a3a2a1a0
Pixel 1: b3b2b1b0
Pixel 2: c3c2c1c0
Pixel 3: d3d2d1d0
Pixel 4: e3e2e1e0
Pixel 5: f3f2f1f0
Pixel 6: g3162g1g0
Pixel 7: h3h2h1h0

For VGA 16 colors. It uses a bit plane, with a total of four Bit Planes. The four digits of one pixel are stored in different bit planes, A byte in the bitwise plane represents a bit of data in 8 pixels:
[VGA 16 colors]
Pixel: 0 1 2 3 4 5 6 7
Bit: 7 6 5 4 3 2 1 0
--------------------------------
Plane 0: A0 B0 C0 D0 E0 F0 G0 H0
Plane 1: A1 B1 C1 D1 E1 F1 G1 H1
PLANE 2: A2 B2 C2 D2 E2 F2 G2 H2
Plane 3: A3 B3 C3 D3 E3 F3 G3 h3

For Dib 16 colors. It adopts a linear approach. Because one pixel is 4 bits, one byte stores 2 pixels:
[Dib 16 colors]
<--- Byte 0 ---> <--- byte 1 ---> <--- byte 2 ---> <--- byte 3 --->
A3a2a1a0b3b2b1b0 c3c2c1c0d3d2d1d0 e3e2e1e0f3f2f1f0 g%2g1%h3h2h1h0

For VGA, the switching bit plane depends on the slow IO port operation. Therefore, bitmap data of the entire scanned row is first converted into four-bit flat data at a time, and then copied to each flat data using the string command. That is to say, when the four-bit plane data of the pixel is separated, the data cannot be directly output and must be written in different buffer zones. We also need to consider concatenating bits into bytes.

For simplicity, we do not consider the Non-eight-fold boundary issue. All data is aligned by 32 bits. And the image size is fixed to 640*480, that is, the scanning line length is fixed to 480.
Because we directly access the VGA video memory and cannot run in windows and other 32-bit protection mode, it is best to use the 16-bit algorithm.

Conventions:

# Define scr_w 640
# Define scr_h 480

# Define scr_planes 4

# Define scansize_dib (scr_w)/2)
# Define scansize_vga (scr_w)/8)

Byte byvga [scr_planes] [scansize_vga];
Byte bydib [scansize_dib];

Since we usually seldom need to obtain bitmap data from the screen, we mainly need to plot the bitmap to the screen, so we should focus on how to implement DIB to VGA.

 

Ii. pixel-by-pixel Algorithm

The idea of this algorithm is very simple. Each time four bits of one pixel are written to four bits:
X = byte ();
Byvga [0] [icurbyte] | = (X & 1) <icurbit;
X = x> 1;
Byvga [1] [icurbyte] | = (X & 1) <icurbit;
X = x> 1;
Byvga [2] [icurbyte] | = (X & 1) <icurbit;
X = x> 1;
Byvga [3] [icurbyte] | = (X & 1) <icurbit;

Because the leftmost pixel is in the upper 4 bits, the actual conversion program looks like this:
X = byte ();
Byvga [3] [icurbyte] | = (X & 0x80)> icurbit;
X = x <1;
Byvga [2] [icurbyte] | = (X & 0x80)> icurbit;
X = x <1;
Byvga [1] [icurbyte] | = (X & 0x80)> icurbit;
X = x <1;
Byvga [0] [icurbyte] | = (X & 0x80)> icurbit;
(Note that the icurbit variable has different meanings)

 

This algorithm writes data to four single-bit planes separately, causing a lot of trouble for address calculation. Besides, it does not make good use of cache, which leads to low code execution speed.

 

3. Bit-by-bit plane algorithm

Because it is too inefficient to access the four planes at the same time, can I process only one plane at a time?
One byte is 8 bits and four bits have 32 bits in total. So what this algorithm does is to combine the scattered eight bits into one byte:

X = DWORD () & 0x11111111; // 000g 000 h 000e 000f 000c 000d 000a 000b
X = bswap (x); // 000a 000b 000c 000d 000e 000f 000g 000 h
X = (X | (x> 3) & 0x03030303; // 0000 00ab 0000 00cd 0000 00ef 0000 00gh
X = (X | (x> 6) & 0x000f000f; // 0000 0000 0000 ABCD 0000 0000 0000 efgh
X = (byte) (X | (x> 12); // 0000 0000 0000 ABCD 0000 0000 ABCD efgh

After careful analysis, we can find that the "& 0x000f000f" operation is not required:
X = DWORD () & 0x11111111; // 000g 000 h 000e 000f 000c 000d 000a 000b
X = bswap (x); // 000a 000b 000c 000d 000e 000f 000g 000 h
X = (X | (x> 3) & 0x03030303; // 0000 00ab 0000 00cd 0000 00ef 0000 00gh
X = (X | (x> 6); // 0000 00ab 0000 ABCD 0000 00ef 0000 efgh
X = (byte) (X | (x> 12); // 0000 00ab 0000 ABCD 00ab 00ef ABCD efgh

The corresponding assembly code is:
; X = DWORD (); // 000g 000 h 000e 000f 000c 000d 000a 000b
; MoV eax, [Si];
; And eax, 11111111 h;
; X = bswap (x); // 000a 000b 000c 000d 000e 000f 000g 000 h
Bswap eax;
; X = (X | (x> 3) & 0x03030303; // 0000 00ab 0000 00cd 0000 00ef 0000 00gh
MoV edX, eax;
SHR edX, 3;
Or eax, EDX;
And eax, 03030303 h;
; X = (X | (x> 6); // 0000 00ab 0000 ABCD 0000 00ef 0000 efgh
MoV edX, eax;
SHR edX, 6;
Or eax, EDX;
; X = (byte) (X | (x> 12); // 0000 00ab 0000 ABCD 00ab 00ef ABCD efgh
MoV edX, eax;
SHR edX, 12;
Or eax, EDX;
; MoV [di], Al;

 

Iii. Double bit-by-bit plane algorithm

Observe the bit-by-bit flat algorithm and you will find that it only uses two registers. X86 has eight General registers, ESP and EBP are used for Stack operations, while ESI and EDI generally use storage addresses. Therefore, we can only use the eax, EBX, ECx, and EDX registers, and can process both of them at the same time.
Because the bit-by-bit plane algorithm has strong data relevance, it now computes two at the same time, that is, two irrelevant data at the same time, which allows the program to run faster on a processor that supports exceeding the limit.

; X = DWORD (); // 000g 000 h 000e 000f 000c 000d 000a 000b
; Push ECx
; MoV eax, [esi];
; MoV Cl, IP
; MoV EBX, [ESI + 4];
; SHR eax, Cl
; Shr ebx, Cl
; And eax, 11111111 h;
; And EBX, 11111111 h;
; X = bswap (x); // 000a 000b 000c 000d 000e 000f 000g 000 h
Bswap eax;
Bswap EBX;
; X = (X | (x> 3) & 0x03030303; // 0000 00ab 0000 00cd 0000 00ef 0000 00gh
MoV edX, eax;
MoV ECx, EBX;
SHR edX, 3;
SHR ECx, 3;
Or eax, EDX;
Or EBX, ECx;
And eax, 03030303 h;
And EBX, 0 3030303 h;
; X = (X | (x> 6); // 0000 00ab 0000 ABCD 0000 00ef 0000 efgh
MoV edX, eax;
MoV ECx, EBX;
SHR edX, 6;
SHR ECx, 6;
Or eax, EDX;
Or EBX, ECx;
; X = (byte) (X | (x> 12); // 0000 00ab 0000 ABCD 00ab 00ef ABCD efgh
MoV edX, eax;
MoV ECx, EBX;
SHR edX, 12;
SHR ECx, 12;
Or Al, DL;
Or BL, Cl;
MoV ah, BL
; MoV [EDI], ax;

4. Other algorithms

4.1 32-bit algorithms without bswap

X = DWORD () & 0x11111111; // 000g 000 h 000e 000f 000c 000d 000a 000b
X = (X | (x> 3) & 0x03030303; // 0000 00gh 0000 00ef 0000 00cd 0000 00ab
X = (X | (x> 14) & 0x00000f0f; // 0000 0000 0000 0000 0000 ghcd 0000 efab
X = x | (x> 4); // ---- ghcd efab
// Exchange GH and AB
T = (x ^ (x> 6) & 0x03; // ---- 0000 00xx. xx = GH XOR AB
X = x ^ t ^ (T <6); // ---- ABCD efgh. AB xor xx = AB xor (gh xor AB) = GH. gh xor xx = gh xor (gh xor AB) = AB

4.2 16-bit Algorithm

16-bit edition:
T = hiword () & 0x1111; // 000g 000 h 000e 000f
X = loword () & 0x1111; // 000c 000d 000a 000b
X = (T <2) | X; // 0da-c 0h0d 0e0a 0f0b
X = (X | (x> 3) & 0x0f0f; // 0000 ghcd 0000 efab
X = x | (x> 4); // ---- ghcd efab
// Exchange GH and AB
T = (x ^ (x> 6) & 0x03; // ---- 0000 00xx. xx = GH XOR AB
X = x ^ t ^ (T <6); // ---- ABCD efgh. AB xor xx = AB xor (gh xor AB) = GH. gh xor xx = gh xor (gh xor AB) = AB

The corresponding assembly code is:
; T = hiword (); // 000g 000 h 000e 000f
; MoV dx, [Si + 2]
And dx, 1111 H
; X = loword (); // 000c 000d 000a 000b
; MoV ax, [Si]
; And ax, 1111 H
; X = (T <2) | X; // 0127c 0h0d 0e0a 0f0b
SHL dx, 2
Or ax, DX
; X = (X | (x> 3) & 0x0f0f; // 0000 ghcd 0000 efab
MoV dx, ax
SHR dx, 3
Or ax, DX
And ax, 0f0f
; X = x | (x> 4); // ---- ghcd efab
MoV dx, ax
SHR dx, 4
Or Al, DL
; // Exchange GH and AB
; T = (x ^ (x> 6) & 0x03; // ---- 0000 00xx. xx = GH XOR AB
MoV DL, Al
Shr dl, 6
Xor dl, Al
And DL, 03 h
; X = x ^ t ^ (T <6); // ---- ABCD efgh. AB xor xx = AB xor (gh xor AB) = GH. gh xor xx = gh xor (gh xor AB) = AB
XOR Al, DL
Shl dl, 6
XOR Al, DL
; MoV [di], Al

 

4.3-bit matrix transpose Algorithm

Looking back, let's take a closer look at the storage methods of dif8 and vga16 colors. We will find that the conversion operation is like a matrix transpose, so that we can perform operations on the four bit planes at the same time. Suppose there is a computer that supports bit matrix transpose commands. Let's imagine how to encode it on that computer.
Because the 4*8 matrix is not neat enough, we need an 8*8 matrix, which is exactly a 64-bit register.

The source data is a DIB bitmap, which is loaded into 64-bit registers:
A3 A2 A1 A0 B3 B2 B1 B0
C3 C2 C1 C0 D3 D2 D1 D0
E3 E2 E1 E0 F3 F2 F1 F0
G3 G2 G1 G0 H3 H2 H1 H0
I3 I2 I1 I0 J3 J2 J1 J0
K3 K2 K1 K0 L3 L2 L1 l0
M3 m2 M1 M0 N3 N2 N1 N0
O3 O2 O1 O0 P3 P2 P1 P0

The size is 4 bits:
A3 A2 A1 A0 I3 I2 I1 I0
B3 B2 B1 B0 J3 J2 J1 J0
C3 C2 C1 C0 K3 K2 K1 K0
D3 D2 D1 D0 L3 L2 L1 l0
E3 E2 E1 E0 m3 m2 M1 M0
F3 F2 F1 F0 N3 N2 N1 N0
G3 G2 G1 G0 O3 O2 O1 O0
H3 H2 H1 H0 P3 P2 P1 P0

Bit matrix transpose:
A3 B3 C3 D3 E3 F3 G3 h3
A2 B2 C2 D2 E2 F2 G2 H2
A1 B1 C1 D1 E1 F1 G1 H1
A0 B0 C0 D0 E0 F0 G0 H0
I3 J3 K3 L3 m3 N3 O3 p3
I2 J2 K2 L2 m2 N2 O2 p2
I1 J1 K1 L1 M1 N1 O1 p1
I0 J0 K0 l0 M0 N0 O0 P0

8-bit external mixed wash:
A3 B3 C3 D3 E3 F3 G3 h3
I3 J3 K3 L3 m3 N3 O3 p3
A2 B2 C2 D2 E2 F2 G2 H2
I2 J2 K2 L2 m2 N2 O2 p2
A1 B1 C1 D1 E1 F1 G1 H1
I1 J1 K1 L1 M1 N1 O1 p1
A0 B0 C0 D0 E0 F0 G0 H0
I0 J0 K0 l0 M0 N0 O0 P0

 

Test Results
~~~~~~~~

DOS version: Compiled using Borland C ++ 3.1 for DoS
VC version: Compiled using Microsoft Visual C ++ 6.0

<1> amd athlon XP 1700 + (actual frequency: 1463 MHz (11x133 ))

DOS version:
[Real DOS mode]
D2v_pixel: 113.5238
D2v_plane16: 178.1790
D2v_plane: 156.0575
D2v_dplane: 624.9337
[Win98]
D2v_pixel: 112.7193
D2v_plane16: 176.9724
D2v_plane: 155.0519
D2v_dplane: 620.9116
[WINXP]
D2v_pixel: 113.2221
D2v_plane16: 177.2740
D2v_plane: 155.4541
D2v_dplane: 623.2243

VC version:
[Win98]
D2v_pixel: 283.6433
D2v_plane: 605.9394
D2v_planeasm: 684.7000
D2v_dplane: 734.9000
D2v_plane16: 493.4000
[WINXP]
D2v_pixel: 296.2000
D2v_plane: 606.5000
D2v_planeasm: 689.9000
D2v_dplane: 737.4000
D2v_plane16: 493.1000

 

<2> intel celeon-S, 1000 MHz (10x100)

DOS version:
[WINXP]
D2v_pixel: 41.4276
D2v_plane16: 133.4331
D2v_plane: 114.2276
D2v_dplane: 320.9635

VC version:
[WINXP]
D2v_pixel: 187.6250
D2v_plane: 355.2224
D2v_planeasm: 378.1487
D2v_dplane: 350.7597
D2v_plane16: 164.0180

It can be seen that the performance of the double bit-by-bit plane algorithm (d2v_dplane) is very superior, especially in DOS, which is much faster than other methods. However, this algorithm is not so outstanding in windows, and sometimes it is slower than the basic bit-by-bit plane algorithm. The reason may be that the modern 32-bit compiler can better generate code for the modern CPU, while bc3.1 is just an outdated 16-bit compiler. However, I think that the DIB to VGA algorithm is to achieve rapid VGA drawing operations, so I am determined to use the double bit-by-bit plane algorithm.

 

References
~~~~~~~~
[1] [us] Henry S. Warren, Jr, translated by Feng de, hacker's delight, published by Mechanical Industry Press, 2004.5
 
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.