Image scaling algorithm

Source: Internet
Author: User
Tags exception handling
Absrtact: First, a basic image scaling algorithm is given, and then the speed and scaling quality are optimized step-by-step.

High-quality, fast image scaling full text is divided into:
On the nearest neighbor sampling interpolation and its speed optimization
Medium two-time linear interpolation and three convolution interpolation
Three linear interpolation and mipmap chains in the next chapter

Body:

For the sake of discussion, only 32bit ARGB color is processed here;
The code uses C + +, when it comes to assembly optimization, it is assumed to be the x86 platform; the compiler used is vc2005;
For the readability of the code, no exception handling code is added;
The CPU used for the test is amd64x2 4200+ (2.37G) and Intel Core2 4400 (2.00G);


Speed Test Description:
Only test the scaling of memory data to memory data
The test picture is 800*600 Zoom to 1024x768; FPS represents the number of frames per second, the higher the value, the faster the function

////////////////////////////////////////////////////////////////////////////////
Windows GDI correlation function Reference speed:
//==============================================================================
BitBlt 544.7 fps//is copy 800*600 to 800*600
BitBlt 331.6 fps//is copy 1024*1024 to 1024*1024
StretchBlt 232.7 fps//is zoom 800*600 to 1024*1024
////////////////////////////////////////////////////////////////////////////////

A: First define the image data structure:
#define ASM __asm

typedef unsigned char TUInt8; [0..255]
struct TARGB32//32 bit color
{
TUInt8 B,g,r,a; A is Alpha
};

Description of struct tpicregion//a piece of color data area for easy parameter passing
{
targb32* pdata; Color Data First Address
Long byte_width; The physical width of a row of data (byte width);
ABS (byte_width) may be greater than or equal to width*sizeof (TARGB32);
Long width; Pixel width
Long height; Pixel height
};

Then the function that accesses a point can be written as:
Inline targb32& Pixels (const tpicregion& pic, const long x, const long y)
{
Return ((targb32*) ((tuint8*) pic.pdata+pic.byte_width*y)) [x];
}
 

B: Scaling principle and Formula diagram:

Image original picture after zooming
(Wide DW, high DH) (Wide SW, high SH)

(Sx-0)/(SW-0) = (Dx-0)/(DW-0) (Sy-0)/(SH-0) = (Dy-0)/(DH-0)
= Sx=dx*sw/dw SY=DY*SH/DH

C: A reference implementation of the scaling algorithm

Give the simplest scaling function (interpolation is the nearest neighbor sample, and I "Try" to write it slower: D)
Src.pcolordata points to the source data area, Dst.pcolordata points to the destination data area
The function zooms the size of the src.width*src.height picture into the dst.width*dst.height area of void PicZoom0 (const tpicregion& Dst, const tpicregion & SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
for (long x=0;x<dst.width;++x)
{
for (long Y=0;y<dst.height;++y)
{
Long srcx= (x*src.width/dst.width);
Long srcy= (y*src.height/dst.height);
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
}
}

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom0 19.4 fps
////////////////////////////////////////////////////////////////////////////////


D: Optimizing the PICZOOM0 function

The A.PICZOOM0 function does not read and write according to the order in which the color data is arranged in memory (the inner loop increments the Y line
The CPU cache read-ahead failures and memory bumps lead to significant performance losses (many hardware has this feature,
including cache, memory, graphics, hard disk, etc., to optimize sequential access, random access can cause a huge loss of performance)
So swap the order of X, Y loops first: void PicZoom1 (const tpicregion& Dst, const tpicregion& SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
for (long Y=0;y<dst.height;++y)
{
for (long x=0;x<dst.width;++x)
{
Long srcx= (x*src.width/dst.width);
Long srcy= (y*src.height/dst.height);
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
}
}

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom1 30.1 fps
////////////////////////////////////////////////////////////////////////////////

B. " (x*src.width/dst.width) "Expression has a division operation, which is a very slow operation (more than the general
Add and subtract operations dozens of times times slower!), using a fixed-point number method to optimize it; void PicZoom2 (const tpicregion& Dst, const tpicregion& SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
The maximum picture size The function can handle 65536*65536
unsigned long xrintfloat_16= (src.width<<16)/dst.width+1; 16.16 format fixed-point number
unsigned long yrintfloat_16= (src.height<<16)/dst.height+1; 16.16 Format fixed point number//can prove: (DST.WIDTH-1) *xrintfloat_16<src.width established for (unsigned long y=0;y<dst.height;++y)
{
For (unsigned long x=0;x<dst.width;++x)
{
unsigned long srcx= (x*xrintfloat_16) >>16;
unsigned long srcy= (y*yrintfloat_16) >>16;
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
} }

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom2 185.8 fps
////////////////////////////////////////////////////////////////////////////////

C. In the x cycle, Y is constant, so the values associated with Y can be calculated in advance; 1. It can be found that the value of Srcy is independent of the X-variable and can advance to the x-axis cycle, 2. Expand the pixels function to optimize the calculation of the pointer associated with y; void PicZoom3 (const tpicregion& Dst,const tpicregion & SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
unsigned long xrintfloat_16= (src.width<<16)/dst.width+1;
unsigned long yrintfloat_16= (src.height<<16)/dst.height+1; unsigned long dst_width=dst.width;
Targb32* Pdstline=dst.pdata;
unsigned long srcy_16=0;
For (unsigned long y=0;y<dst.height;++y)
{
targb32* psrcline= ((targb32*) ((tuint8*) src.pdata+src.byte_width* (srcy_16>>16)));
unsigned long srcx_16=0;
For (unsigned long x=0;x<dst_width;++x)
{
pdstline[x]=psrcline[srcx_16>>16];
Srcx_16+=xrintfloat_16;
}
Srcy_16+=yrintfloat_16;
((tuint8*&) pdstline) +=dst.byte_width;
}
}

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom3 414.4 fps
////////////////////////////////////////////////////////////////////////////////

d. Fixed-point optimization allows the maximum picture size and scaling result (imperceptible to the unaided eye) that the function can handle is subject to a
Fixed effect, here is a version that uses floating-point arithmetic, which can be used where there is a need: void Piczoom3_float (const tpicregion& Dst, const tpicregion& SRC)
{
Note: This function requires FPU support
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
Double xrfloat=1.000000001/((double) dst.width/src.width);
Double yrfloat=1.000000001/((double) dst.height/src.height);

unsigned short rc_old;
unsigned short rc_edit;
ASM//Sets the rounding method of the FPU in order to use fist floating point instructions directly
{
FNSTCW rc_old//Save coprocessor Control Word, used to restore
FNSTCW Rc_edit//Save coprocessor control Word to modify
Fwait
OR Rc_edit, 0x0f00//change to rc=11 to get the FPU rounded to 0
FLDCW rc_edit//load coprocessor control word, RC field has been modified
}

unsigned long dst_width=dst.width;
Targb32* Pdstline=dst.pdata;
Double srcy=0;
For (unsigned long y=0;y<dst.height;++y)
{
targb32* psrcline= ((targb32*) ((tuint8*) src.pdata+src.byte_width* ((long) srcy));
/**//*
Double srcx=0;
For (unsigned long x=0;x<dst_width;++x)
{
pdstline[x]=psrcline[(unsigned long) srcx];//because the default floating-point rounding is a very slow
The Operation! This is why the inline assembly code that operates the FPU directly is used.
Srcx+=xrfloat;
}*/
ASM FLD Xrfloat//st0==xrfloat
ASM Fldz//st0==0 St1==xrfloat
unsigned long srcx=0;
for (long x=0;x<dst_width;++x)
{
ASM fist DWORD ptr srcx//srcx= (long) st0
PDSTLINE[X]=PSRCLINE[SRCX];
ASM Fadd st,st (1)//st0+=st1 st1==xrfloat
}
ASM FSTP St
ASM FSTP St

Srcy+=yrfloat;
((tuint8*&) pdstline) +=dst.byte_width;
}

ASM//Recovery FPU Rounding Method
{
Fwait
FLDCW Rc_old
}
}

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
Piczoom3_float 286.2 fps
////////////////////////////////////////////////////////////////////////////////


E. Note the fact that the scaling of each row is fixed; you can pre-create a thumbnail projection table
To deal with the reduction projection algorithm (piczoom3_table and piczoom3_float implementation equivalence); void piczoom3_table (const tpicregion& Dst, const tpicregion& SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
unsigned long dst_width=dst.width;
unsigned long * srcx_table = new unsigned long [dst_width];
For (unsigned long x=0;x<dst_width;++x)//Generate table Srcx_table
{
Srcx_table[x]= (X*src.width/dst.width);
}

Targb32* Pdstline=dst.pdata;
For (unsigned long y=0;y<dst.height;++y)
{
unsigned long srcy= (y*src.height/dst.height);
targb32* psrcline= ((targb32*) ((tuint8*) src.pdata+src.byte_width*srcy));
For (unsigned long x=0;x<dst_width;++x)
PDSTLINE[X]=PSRCLINE[SRCX_TABLE[X]];
((tuint8*&) pdstline) +=dst.byte_width;
}

delete [] srcx_table;
}

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
Piczoom3_table 390.1 fps
////////////////////////////////////////////////////////////////////////////////

F. To speed up zooming, you can get a faster zoom function by dynamically generating functions based on the scale.
It's a bit like how the compiler works; it takes a lot of effort (or obscure) to achieve it.
(Dynamic generation is a good idea, but the individual feels that it is not necessary to achieve it for scaling)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.