Absrtact: First, a basic image scaling algorithm is given, and then the speed and scaling quality are optimized step-by-step.
High-quality, fast image scaling full text is divided into:
On the nearest neighbor sampling interpolation and its speed optimization
Medium two-time linear interpolation and three convolution interpolation
Three linear interpolation and mipmap chains in the next chapter
Body:
For the sake of discussion, only 32bit ARGB color is processed here;
The code uses C + +, when it comes to assembly optimization, it is assumed to be the x86 platform; the compiler used is vc2005;
For the readability of the code, no exception handling code is added;
The CPU used for the test is amd64x2 4200+ (2.37G) and Intel Core2 4400 (2.00G);
Speed Test Description:
Only test the scaling of memory data to memory data
The test picture is 800*600 Zoom to 1024x768; FPS represents the number of frames per second, the higher the value, the faster the function
////////////////////////////////////////////////////////////////////////////////
Windows GDI correlation function Reference speed:
//==============================================================================
BitBlt 544.7 fps//is copy 800*600 to 800*600
BitBlt 331.6 fps//is copy 1024*1024 to 1024*1024
StretchBlt 232.7 fps//is zoom 800*600 to 1024*1024
////////////////////////////////////////////////////////////////////////////////
A: First define the image data structure:
#define ASM __asm
typedef unsigned char TUInt8; [0..255]
struct TARGB32//32 bit color
{
TUInt8 B,g,r,a; A is Alpha
};
Description of struct tpicregion//a piece of color data area for easy parameter passing
{
targb32* pdata; Color Data First Address
Long byte_width; The physical width of a row of data (byte width);
ABS (byte_width) may be greater than or equal to width*sizeof (TARGB32);
Long width; Pixel width
Long height; Pixel height
};
Then the function that accesses a point can be written as:
Inline targb32& Pixels (const tpicregion& pic, const long x, const long y)
{
Return ((targb32*) ((tuint8*) pic.pdata+pic.byte_width*y)) [x];
}
B: Scaling principle and Formula diagram:
Image original picture after zooming
(Wide DW, high DH) (Wide SW, high SH)
(Sx-0)/(SW-0) = (Dx-0)/(DW-0) (Sy-0)/(SH-0) = (Dy-0)/(DH-0)
= Sx=dx*sw/dw SY=DY*SH/DH
C: A reference implementation of the scaling algorithm
Give the simplest scaling function (interpolation is the nearest neighbor sample, and I "Try" to write it slower: D)
Src.pcolordata points to the source data area, Dst.pcolordata points to the destination data area
The function zooms the size of the src.width*src.height picture into the dst.width*dst.height area of void PicZoom0 (const tpicregion& Dst, const tpicregion & SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
for (long x=0;x<dst.width;++x)
{
for (long Y=0;y<dst.height;++y)
{
Long srcx= (x*src.width/dst.width);
Long srcy= (y*src.height/dst.height);
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
}
}
////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom0 19.4 fps
////////////////////////////////////////////////////////////////////////////////
D: Optimizing the PICZOOM0 function
The A.PICZOOM0 function does not read and write according to the order in which the color data is arranged in memory (the inner loop increments the Y line
The CPU cache read-ahead failures and memory bumps lead to significant performance losses (many hardware has this feature,
including cache, memory, graphics, hard disk, etc., to optimize sequential access, random access can cause a huge loss of performance)
So swap the order of X, Y loops first: void PicZoom1 (const tpicregion& Dst, const tpicregion& SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
for (long Y=0;y<dst.height;++y)
{
for (long x=0;x<dst.width;++x)
{
Long srcx= (x*src.width/dst.width);
Long srcy= (y*src.height/dst.height);
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
}
}
////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom1 30.1 fps
////////////////////////////////////////////////////////////////////////////////
B. " (x*src.width/dst.width) "Expression has a division operation, which is a very slow operation (more than the general
Add and subtract operations dozens of times times slower!), using a fixed-point number method to optimize it; void PicZoom2 (const tpicregion& Dst, const tpicregion& SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
The maximum picture size The function can handle 65536*65536
unsigned long xrintfloat_16= (src.width<<16)/dst.width+1; 16.16 format fixed-point number
unsigned long yrintfloat_16= (src.height<<16)/dst.height+1; 16.16 Format fixed point number//can prove: (DST.WIDTH-1) *xrintfloat_16<src.width established for (unsigned long y=0;y<dst.height;++y)
{
For (unsigned long x=0;x<dst.width;++x)
{
unsigned long srcx= (x*xrintfloat_16) >>16;
unsigned long srcy= (y*yrintfloat_16) >>16;
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
} }
////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom2 185.8 fps
////////////////////////////////////////////////////////////////////////////////
C. In the x cycle, Y is constant, so the values associated with Y can be calculated in advance; 1. It can be found that the value of Srcy is independent of the X-variable and can advance to the x-axis cycle, 2. Expand the pixels function to optimize the calculation of the pointer associated with y; void PicZoom3 (const tpicregion& Dst,const tpicregion & SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
unsigned long xrintfloat_16= (src.width<<16)/dst.width+1;
unsigned long yrintfloat_16= (src.height<<16)/dst.height+1; unsigned long dst_width=dst.width;
Targb32* Pdstline=dst.pdata;
unsigned long srcy_16=0;
For (unsigned long y=0;y<dst.height;++y)
{
targb32* psrcline= ((targb32*) ((tuint8*) src.pdata+src.byte_width* (srcy_16>>16)));
unsigned long srcx_16=0;
For (unsigned long x=0;x<dst_width;++x)
{
pdstline[x]=psrcline[srcx_16>>16];
Srcx_16+=xrintfloat_16;
}
Srcy_16+=yrintfloat_16;
((tuint8*&) pdstline) +=dst.byte_width;
}
}
////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom3 414.4 fps
////////////////////////////////////////////////////////////////////////////////
d. Fixed-point optimization allows the maximum picture size and scaling result (imperceptible to the unaided eye) that the function can handle is subject to a
Fixed effect, here is a version that uses floating-point arithmetic, which can be used where there is a need: void Piczoom3_float (const tpicregion& Dst, const tpicregion& SRC)
{
Note: This function requires FPU support
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
Double xrfloat=1.000000001/((double) dst.width/src.width);
Double yrfloat=1.000000001/((double) dst.height/src.height);
unsigned short rc_old;
unsigned short rc_edit;
ASM//Sets the rounding method of the FPU in order to use fist floating point instructions directly
{
FNSTCW rc_old//Save coprocessor Control Word, used to restore
FNSTCW Rc_edit//Save coprocessor control Word to modify
Fwait
OR Rc_edit, 0x0f00//change to rc=11 to get the FPU rounded to 0
FLDCW rc_edit//load coprocessor control word, RC field has been modified
}
unsigned long dst_width=dst.width;
Targb32* Pdstline=dst.pdata;
Double srcy=0;
For (unsigned long y=0;y<dst.height;++y)
{
targb32* psrcline= ((targb32*) ((tuint8*) src.pdata+src.byte_width* ((long) srcy));
/**//*
Double srcx=0;
For (unsigned long x=0;x<dst_width;++x)
{
pdstline[x]=psrcline[(unsigned long) srcx];//because the default floating-point rounding is a very slow
The Operation! This is why the inline assembly code that operates the FPU directly is used.
Srcx+=xrfloat;
}*/
ASM FLD Xrfloat//st0==xrfloat
ASM Fldz//st0==0 St1==xrfloat
unsigned long srcx=0;
for (long x=0;x<dst_width;++x)
{
ASM fist DWORD ptr srcx//srcx= (long) st0
PDSTLINE[X]=PSRCLINE[SRCX];
ASM Fadd st,st (1)//st0+=st1 st1==xrfloat
}
ASM FSTP St
ASM FSTP St
Srcy+=yrfloat;
((tuint8*&) pdstline) +=dst.byte_width;
}
ASM//Recovery FPU Rounding Method
{
Fwait
FLDCW Rc_old
}
}
////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
Piczoom3_float 286.2 fps
////////////////////////////////////////////////////////////////////////////////
E. Note the fact that the scaling of each row is fixed; you can pre-create a thumbnail projection table
To deal with the reduction projection algorithm (piczoom3_table and piczoom3_float implementation equivalence); void piczoom3_table (const tpicregion& Dst, const tpicregion& SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
unsigned long dst_width=dst.width;
unsigned long * srcx_table = new unsigned long [dst_width];
For (unsigned long x=0;x<dst_width;++x)//Generate table Srcx_table
{
Srcx_table[x]= (X*src.width/dst.width);
}
Targb32* Pdstline=dst.pdata;
For (unsigned long y=0;y<dst.height;++y)
{
unsigned long srcy= (y*src.height/dst.height);
targb32* psrcline= ((targb32*) ((tuint8*) src.pdata+src.byte_width*srcy));
For (unsigned long x=0;x<dst_width;++x)
PDSTLINE[X]=PSRCLINE[SRCX_TABLE[X]];
((tuint8*&) pdstline) +=dst.byte_width;
}
delete [] srcx_table;
}
////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
Piczoom3_table 390.1 fps
////////////////////////////////////////////////////////////////////////////////
F. To speed up zooming, you can get a faster zoom function by dynamically generating functions based on the scale.
It's a bit like how the compiler works; it takes a lot of effort (or obscure) to achieve it.
(Dynamic generation is a good idea, but the individual feels that it is not necessary to achieve it for scaling)