Image scaling algorithm

Source: Internet
Author: User
Tags exception handling
Reproduced others, but this article is really good, so want to share out, but the original article address can not find, it is a pity.
Image scaling algorithm

Absrtact: First, a basic image scaling algorithm is given, and then the speed and scaling quality are optimized step-by-step.

High-quality, fast image scaling full text is divided into:
On the nearest neighbor sampling interpolation and its speed optimization
Medium two-time linear interpolation and three convolution interpolation
Three linear interpolation and mipmap chains in the next chapter

Body:

For the sake of discussion, only 32bit ARGB color is processed here;
The code uses C + +, when it comes to assembly optimization, it is assumed to be the x86 platform; the compiler used is vc2005;
For the readability of the code, no exception handling code is added;
The CPU used for the test is amd64x2 4200+ (2.37G) and Intel Core2 4400 (2.00G);


Speed Test Description:
Only test the scaling of memory data to memory data
The test picture is 800*600 Zoom to 1024x768; FPS represents the number of frames per second, the higher the value, the faster the function

////////////////////////////////////////////////////////////////////////////////
Windows GDI correlation function Reference speed:
//==============================================================================
BitBlt 544.7 fps//is copy 800*600 to 800*600
BitBlt 331.6 fps//is copy 1024*1024 to 1024*1024
StretchBlt 232.7 fps//is zoom 800*600 to 1024*1024
////////////////////////////////////////////////////////////////////////////////

A: First define the image data structure:
#define ASM __asm

typedef unsigned char TUInt8; [0..255]
struct TARGB32//32 bit color
{
TUInt8 B,g,r,a; A is Alpha
};

Description of struct tpicregion//a piece of color data area for easy parameter passing
{
targb32* pdata; Color Data First Address
Long byte_width; The physical width of a row of data (byte width);
ABS (byte_width) may be greater than or equal to width*sizeof (TARGB32);
Long width; Pixel width
Long height; Pixel height
};

Then the function that accesses a point can be written as:
Inline targb32& Pixels (const tpicregion& pic,const long X,const long y)
{
Return ((targb32*) ((tuint8*) pic.pdata+pic.byte_width*y)) [x];
}
 

B: Scaling principle and Formula diagram:

Image original picture after zooming
(Wide DW, high DH) (Wide SW, high SH)

(Sx-0)/(SW-0) = (Dx-0)/(DW-0) (Sy-0)/(SH-0) = (Dy-0)/(DH-0)
= Sx=dx*sw/dw SY=DY*SH/DH

C: A reference implementation of the scaling algorithm

Give the simplest scaling function (interpolation is the nearest neighbor sample, and I "Try" to write it slower: D)
Src.pcolordata points to the source data area, Dst.pcolordata points to the destination data area
The Src.width*src.height function zooms a picture of size to dst.width*dst.height in the area of void PicZoom0 (const tpicregion& Dst,const tpicregion & SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
for (long x=0;x<dst.width;++x)
{
for (long Y=0;y<dst.height;++y)
{
Long srcx= (x*src.width/dst.width);
Long srcy= (y*src.height/dst.height);
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
}
}

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom0 19.4 fps
////////////////////////////////////////////////////////////////////////////////


D: Optimizing the PICZOOM0 function

The A.PICZOOM0 function does not read and write according to the order in which the color data is arranged in memory (the inner loop increments the Y line
The CPU cache read-ahead failures and memory bumps lead to significant performance losses (many hardware has this feature,
including cache, memory, graphics, hard disk, etc., to optimize sequential access, random access can cause a huge loss of performance)
So swap the order of X, Y loops first: void PicZoom1 (const tpicregion& dst,const tpicregion& SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
for (long Y=0;y<dst.height;++y)
{
for (long x=0;x<dst.width;++x)
{
Long srcx= (x*src.width/dst.width);
Long srcy= (y*src.height/dst.height);
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
}
}

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom1 30.1 fps
////////////////////////////////////////////////////////////////////////////////

B. " (x*src.width/dst.width) "Expression has a division operation, which is a very slow operation (more than the general
Add and subtract operations dozens of times times slower!), using a fixed-point number method to optimize it; void PicZoom2 (const tpicregion& dst,const tpicregion& SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
The maximum picture size The function can handle 65536*65536
unsigned long xrintfloat_16= (src.width<<16)/dst.width+1; 16.16 format fixed-point number
unsigned long yrintfloat_16= (src.height<<16)/dst.height+1; 16.16 Format fixed point number//can prove: (DST.WIDTH-1) *xrintfloat_16<src.width established for (unsigned long y=0;y<dst.height;++y)
{
For (unsigned long x=0;x<dst.width;++x)
{
unsigned long srcx= (x*xrintfloat_16) >>16;
unsigned long srcy= (y*yrintfloat_16) >>16;
Pixels (dst,x,y) =pixels (Src,srcx,srcy);
}
} }

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom2 185.8 fps
////////////////////////////////////////////////////////////////////////////////

C. In the x cycle, Y is constant, so the values associated with Y can be calculated in advance; 1. It can be found that the value of Srcy is independent of the X-variable and can advance to the x-axis cycle, 2. Expand the pixels function to optimize the calculation of the pointer associated with y; void PicZoom3 (const tpicregion& Dst,const tpicregion & SRC)
{
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
unsigned long xrintfloat_16= (src.width<<16)/dst.width+1;
unsigned long yrintfloat_16= (src.height<<16)/dst.height+1; unsigned long dst_width=dst.width;
Targb32* Pdstline=dst.pdata;
unsigned long srcy_16=0;
For (unsigned long y=0;y<dst.height;++y)
{
targb32* psrcline= ((targb32*) ((tuint8*) src.pdata+src.byte_width* (srcy_16>>16)));
unsigned long srcx_16=0;
For (unsigned long x=0;x<dst_width;++x)
{
pdstline[x]=psrcline[srcx_16>>16];
Srcx_16+=xrintfloat_16;
}
Srcy_16+=yrintfloat_16;
((tuint8*&) pdstline) +=dst.byte_width;
}
}

////////////////////////////////////////////////////////////////////////////////
Speed test:
//==============================================================================
PicZoom3 414.4 fps
////////////////////////////////////////////////////////////////////////////////

d. Fixed-point optimization allows the maximum picture size and scaling result (imperceptible to the unaided eye) that the function can handle is subject to a
A version of a floating-point operation that can be used where this is required: void Piczoom3_float (const tpicregion& dst,const tpicregion& SRC)
{
Note: This function requires FPU support
if ((0==dst.width) | | | (0==dst.height)
|| (0==src.width) | | (0==src.height)) Return
Double xrfloat=1.000000001/((double) dst.width/src.width);
Double yrfloat=1.000000001/((double) dst.height/src.height);

unsigned short rc_old;
unsigned short rc_edit;
ASM//Sets the rounding method of the FPU in order to use fist floating point instructions directly
{
FNSTCW rc_old//Save coprocessor Control Word, used to restore
FNSTCW Rc_edit//Save coprocessor control Word to modify
Fwait
OR Rc_edit, 0x0f00//change to rc=11 to get the FPU rounded to 0
FLDCW rc_edit//load coprocessor control word, RC field has been modified
}

unsigned long dst_width=dst.width;
Targb32* Pdstline=dst.pdata;
Double srcy=0;
For (unsigned long y=0;y<dst.height;++y)
{
targb32* psrcline= ((targb32*) ((tuint8*) src.pdata+src.byte_width* ((long) srcy));
/**//*
Double srcx=0;
For (unsigned long x=0;x<dst_width;++x)
{
pdstline[x]=psrcline[(unsigned long) srcx];//because the default floating-point rounding is a very slow
The Operation! This is why the inline assembly code that operates the FPU directly is used.
Srcx+=xrfloat;
}*/
   &nbs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.