Implementation of kdrive's xvide Acceleration

Last Update:2018-12-04 Source: Internet

Author: User

Tags winbox

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How did I send it to the it168 blog just now? What is the redirection of Cu? It168 cannot exceed 10 thousand words, but cannot be sent. There are two parts
I still think this is good. I will post this article.
Kdrive's xvide acceleration implementation.

First, xvideo is implemented in the driver. In fact, we need to make a driver first.
The implementation of the kdrive hardware acceleration driver is actually the implementation of Kaa. Kaa is divided into the xserver side and the driver side, and the xserver side will provide a mechanism
This mechanism determines whether the callback function is registered during GC operations. If the callback function is registered, the callback function is used. If the callback function is not registered, the software implementation is used,
Registering this callback function is implemented in the kdrive driver. Because kdrive cannot dynamically load drivers, drivers are directly compiled to xserver.
That is to say, the compiled xfbdev directly carries the driver.

Let's first look at the code layout, which was previously implemented on the basis of xserver 1.5.x. Let's talk about it in 1.5.3.
As mentioned before about the relationship between kdrive and xserver, kdrive is a type of xserver. xserver has several implementations in the Xorg code tree, but the general code is shared.
For example, the basic event processing and clip code are all shared. The HW directory contains different xserver implementations. Xorg is only one of them. As mentioned above, kdrive and Xorg have many
The code is shared, so the Xorg code is updated too quickly and the kdrive is updated too slowly. As a result, many events of kdrive are problematic.

We should first talk about the implementation of Kaa, but because the implementation of xivdeo is relatively simple, we should first talk about it.

Let's first look at the xvideo workflow, that is, how xvideo works.

Xvideo is designed to output video or frame output directly to the screen, store as few as possible, and use hardware acceleration as much as possible. Xorg manages the final output (this is the case currently)
Therefore, there must be an extension in xserver. This is the extension of XV,

Here we need an image description

We can see that the most primitive data is nothing more than a video file. Let's assume that this file is an MPEG4 file. After Demux, there will be different data streams, such as audio, video, and others,
But here we only focus on the video, and the video stream is decoded. After decoding, it becomes a frame-by-frame image, which is the same as playing a movie. Then we need to display the image.
On the screen, the final output functions are generally xputimage, xvputimage, and xvshmputimage functions similar to this.
Of course, using the SHM version is much faster than not using the SHM version. xshm is also an extension of xserver, which is mainly used to reduce data copy.
Generally, the written player judges whether some xservers support the SHM extension. If yes, the SHM function is used. For example, mplayer and gstreamer (xvimagesink)
Because the xserver framework is in the C/S mode, it is easy to understand, but many people still do not know why. We will understand it from the original architecture.
The original X can be transmitted over the network, but because of this mode we are not using a lot, except for the use of SSH-x abc@192.168.1.100, we may
This remote transmission is rarely used. We generally use UNIX socket on the local machine. That is to say, all requests are actually like this,

We write an xputimage function in the application. In fact, this function is provided by libx11 (if I remember correctly). functions such as xvshmputimage are implemented by libxext.
The code in this example encapsulates an X Request and sends it through xsend. Then, the xserver will dispatch these messages, which are processed by corresponding extensions or functions, because these xext are initialized.
Xserver will be told to process such messages by their own extensions.

So think about it. If it is a large frame of data, it will be very slow to use socket, so there will be a lot of xshm, and SHM is the same as our common SHM, it is shared memory, that is, the user program and
Xserver shares a memory space. When the user program fills the frame, it is actually directly written to the xserver memory space (image description is required), and then the xserver can be directly displayed, which saves a lot of space.
The ximage data interface has a Data Pointer, but it does not exist in the SHM version. It will contain an address. Xshmimage. These contents are written by memory and may be missed.

In other words, the corresponding processing process of xvideo on the xserver side.
Xserver will receive xshmputimage request, which is implemented in the xserver xorg-server-1.5.3/xext/SHM. c

Static int
Sprocshmputimage (client)
Clientptr client;
{
Register int N;
Request (xshmputimagereq );
SWAps (& stuff-> length, N );
Request_size_match (xshmputimagereq );
Swapl (& stuff-> drawable, N );
Swapl (& stuff-> GC, N );
SWAps (& stuff-> totalwidth, N );
SWAps (& stuff-> totalheight, N );
SWAps (& stuff-> srcx, N );
SWAps (& stuff-> srcy, N );
SWAps (& stuff-> srcwidth, N );
SWAps (& stuff-> srcheight, N );
SWAps (& stuff-> dstx, N );
SWAps (& stuff-> dsty, N );
Swapl (& stuff-> shmseg, N );
Swapl (& stuff-> offset, N );
Return procshmputimage (client );
}

Procshmputimage this function will call the following function to call the registration callback function of xputimage,
(* PGC-> OPS-> putimage) (pdraw, PGC, stuff-> depth,
Stuff-> dstx, stuff-> dsty,
Stuff-> totalwidth, stuff-> srcheight,
Stuff-> srcx, stuff-> Format,
Shmdesc-> ADDR + stuff-> Offset +
(Stuff-> srcy * length ));

Start with dispatch.
Xext/xvdisp. c
# Define xvcall (name) XV # Name

Procxvputimage
This is the last call,
Return xvcall (diputimage) (client, pdraw, pport, PGC,
Stuff-> src_x, stuff-> src_y,
Stuff-> src_w, stuff-> src_h,
Stuff-> drw_x, stuff-> drw_y,
Stuff-> drw_w, stuff-> drw_h,
Pimage, (unsigned char *) (& stuff [1]), false,
Stuff-> width, stuff-> height );

In fact, it is the following function xvdiputimage
Xext/xvmain. c is the specific implementation of xvideo. Of course, Xorg and kdrive actually use the code here.
Xvdiputimage
The following callback function is called.
Return (* pport-> padaptor-> ddputimage) (client, pdraw, pport, PGC,
Src_x, src_y, src_w, src_h,
Drw_x, drw_y, drw_w, drw_h,
Image, Data, sync, width, height );

We need to implement ddputimage.
This function is registered by GC. Here we use the kdrive framework, which is actually registered by Kaa.
Then, the specific implementation of Kaa, that is, our driver, will register the Kaa function.

All kdrive functions are
Xorg-server-1.5.3/HW/kdrive
Ailantian @ VAX:/mnt/sdb1/ubd/soft/Xorg/temp/xorg-server-1.5.3/HW/kdrive $ ls
ATI ephyr fake i810 mach64 makefile. In neomagic PM2 SDL SMI VESA
Chips Epson fbdev Linux makefile. Am mga nvidia r128 sis300 SRC
Ailantian @ VAX:/mnt/sdb1/ubd/soft/Xorg/temp/xorg-server-1.5.3/HW/kdrive $ pwd
In this case, each directory is basically a driver, such as Intel 810, ATI, and many other drivers.
The Kaa framework is implemented in HW/kdrive/Kaa. C. The implementation of the xvideo framework is implemented in kxv.

Our driver actually calls the following function through a certain mechanism,
Kdxvinitadaptors

Pa-> pscreen = pscreen;
Pa-> ddallocateport = kdxvallocateport;
Pa-> ddfreeport = kdxvfreeport;
Pa-> ddputvideo = kdxvputvideo;
Pa-> ddputstill = kdxvputstill;
Pa-> ddgetvideo = kdxvgetvideo;
Pa-> ddgetstill = kdxvgetstill;
Pa-> ddstopvideo = kdxvstopvideo;
Pa-> ddputimage = kdxvputimage;
Pa-> ddsetportattribute = kdxvsetportattribute;
Pa-> ddgetportattribute = kdxvgetportattribute;
Pa-> ddquerybestsize = kdxvquerybestsize;
Pa-> ddqueryimageattributes = kdxvqueryimageattributes;

In fact, we only focus on
Pa-> ddputimage = kdxvputimage;

This function is called in the xvideo extension of xserver.
That is to say, kxv is actually a framework, and someone will fill in the callback function such as kdxvputimage. This is what we have to do in the Kaa driver.

In fact, i810 and so on can be directly referred to, we can directly copy a directory, and then as the basis of their own driver development, and then change the makefile on the line.
Here I will explain it with ATI as an example. My own code is messy and there may be other problems, so it is inconvenient to post them.
Generally, xvideo is implemented in xxx_video.c. For example, if we look at ATI, it is in ati_video.c.
In fact, the amount of code is relatively small.
In fact, the output is still the function,
Atiputimage, and then this function will generally call functions such as displayvideo, but this is just a developer's habit.
Let's take a look at the framework.
Take ATI as an example. Where does the registration callback function start,
Grep, the code will know, actually it is.
Atisetupimagevideo

Atiinitvideo calls atisetupimagevideo
Depending on your preferences, You have to register these callback functions,
The following is the key code. The so-called frameworks are all here.
Static kdvideoadaptorptr
Atisetupimagevideo (screenptr pscreen)
{
Kdscreenpriv (pscreen );
Atiscreeninfo (pscreenpriv );
Kdvideoadaptorptr adapt;
Atiportprivptr pportpriv;
Int I;

ATIS-> num_texture_ports = 16;

Adapt = xcalloc (1, sizeof (kdvideoadaptorrec) + ATIS-> num_texture_ports *
(Sizeof (atiportprivrec) + sizeof (devunion )));
If (Adapt = NULL)
Return NULL;

Adapt-> type = xvwindowmask | xvinputmask | xvimagemask;
Adapt-> flags = video_clip_to_viewport;
Adapt-> name = "ATI texture video ";
Adapt-> nencodings = 1;
Adapt-> pencodings = dummyencoding;
Adapt-> nformats = num_formats;
Adapt-> pformats = formats;
Adapt-> nports = ATIS-> num_texture_ports;
Adapt-> pportprivates = (devunion *) (& adapt [1]);

Pportpriv =
(Atiportprivptr) (& adapt-> pportprivates [ATIS-> num_texture_ports]);

For (I = 0; I <ATIS-> num_texture_ports; I ++)
Adapt-> pportprivates [I]. PTR = & pportpriv [I];

Adapt-> nattributes = num_attributes;
Adapt-> pattributes = attributes;
Adapt-> pimages = images;
Adapt-> nimages = num_images;
Adapt-> putvideo = NULL;
Adapt-> putstill = NULL;
Adapt-> getvideo = NULL;
Adapt-> getstill = NULL;
Adapt-> stopvideo = atistopvideo;
Adapt-> setportattribute = atisetportattribute;
Adapt-> getportattribute = atigetportattribute;
Adapt-> querybestsize = atiquerybestsize;
Adapt-> putimage = atiputimage;
Adapt-> reputimage = atireputimage;
Adapt-> queryimageattributes = atiqueryimageattributes;

/* Gotta uninit this someplace */
Region_init (pscreen, & pportpriv-> clip, nullbox, 0 );

ATIS-> padaptor = adapt;

Xvbrightness = make_atom ("xv_brightness ");
Xvsaturation = make_atom ("xv_saturation ");

Return adapt;
}

The most important function above is to register adapt-> putimage = atiputimage;
This function will be called at the end.

You may ask how this adapt relates to the above kdxvputimage, which is to be registered.
Kdxvscreeninit-> kdxvinitadaptors

Adaptorpriv-> flags = adaptorptr-> flags;
Adaptorpriv-> putvideo = adaptorptr-> putvideo;
Adaptorpriv-> putstill = adaptorptr-> putstill;
Adaptorpriv-> getvideo = adaptorptr-> getvideo;
Adaptorpriv-> getstill = adaptorptr-> getstill;
Adaptorpriv-> stopvideo = adaptorptr-> stopvideo;
Adaptorpriv-> setportattribute = adaptorptr-> setportattribute;
Adaptorpriv-> getportattribute = adaptorptr-> getportattribute;
Adaptorpriv-> querybestsize = adaptorptr-> querybestsize;
Adaptorpriv-> queryimageattributes = adaptorptr-> queryimageattributes;
Adaptorpriv-> putimage = adaptorptr-> putimage;
Adaptorpriv-> reputimage = adaptorptr-> reputimage;

Portpriv-> adaptorrec = adaptorpriv;

This function pointer is placed in adaptorrec, which is called in the kdxvxxxx function.
For example
Kdxvputimage
At the time of output, the registered callback function will be called.
Ret = (* portpriv-> adaptorrec-> putimage) (portpriv-> screen, pdraw,
Src_x, src_y, winbox. X1, winbox. Y1,
Src_w, src_h, drw_w, drw_h, format-> ID, Data, width, height,
Sync, & clipregion, portpriv-> devpriv. PTR );

This callback function is a function registered in atisetupimagevideo. It comes here through a series of turns.

Let's go back to the topic and just look at the most important part and how the hardware accelerates the output. In fact, looking at the code grep is definitely related to registers or mmio, because the old framework uses mmio.

The xrandroid processing is added to the atiputimage. We don't need to worry about this. It is mainly caused by screen rotation, so we need to handle it. However, if we don't rotate it, we don't need to pay attention to this code.

The most critical code is as follows: the decoded image is in the memory, but in fact we have to display it in the video memory. I don't want to talk about Kaa memory management here, deviated from the title,
If (pportpriv-> off_screen = NULL ){
Pportpriv-> off_screen = kdoffscreenalloc (screen-> pscreen,
Size * 2, 64, true, ativideosave, pportpriv );
If (pportpriv-> off_screen = NULL)
Return badalloc;
}
We need to allocate an area in the video memory to store the image.
Next, let's take a look at the pixmap judgment,
In short, we must ensure that this is in the video memory, whether it is offscreen or not.
Then there is the following code,
Switch (ID ){
Case fourcc_yv12:
Case fourcc_i420:
Top ~ 1;
Nlines = (rot_y2 + 0 xFFFF)> 16) + 1 )&~ 1)-top;
Kdxvcopyplanardata (screen, Buf, pportpriv-> src_addr, randroid,
Srcpitch, srcpitch2, dstpitch, rot_src_w, rot_src_h,
Height, top, left, nlines, npixels, ID );
Break;
Case fourcc_uyvy:
Case fourcc_yuy2:
Default:
Nlines = (rot_y2 + 0 xFFFF)> 16)-top;
Kdxvcopypackeddata (screen, Buf, pportpriv-> src_addr, randroid,
Srcpitch, dstpitch, rot_src_w, rot_src_h, top, left,
Nlines, npixels );
Break;
}

This code is actually used to convert the data format. That is to say, the hardware acceleration format supported by the video card is actually limited. If this format is not supported by the hardware, we need software.
Format conversion. Here, xserver processes a little rigid. xserver only supports two formats for the final output, so all formats will be converted to those two formats. If you remember correctly
It should be yvyu and vyuy. That is to say, the current xserver framework encapsulates data into these two formats and then sends them to the hardware. Of course, this driver is implemented by itself. Generally, the hardware supports many formats,
We can also see the above Code in the last four formats passed from the user application. This is because our driver only registers four formats, that is, we only support these four formats,
Similar to mplayer, mplayer can use a function to query the formats supported by the current driver. You can also use the xvinfo x tool.
The registration format is here. If the hardware supports more hardware such as nv12 and nv21, it seems that the general embedded processor supports 2D processing.
Static kdimagerec images [num_images] =
{
Xvimage_yuy2,
Xvimage_yv12,
Xvimage_i420,
Xvimage_uyvy
};

By default, xserver only supports these four CCCs. If you want more, you can copy these definitions from other drivers, or copy mplayer or pixman. Because many formats have not been
Standardization.

After the data pack is complete, it is sent to the hardware for display.

This is relatively simple, because we have a buffer zone, which is already in the memory, and we want to display it somewhere on the screen.
Pay attention to the differences between offscreen and online screen. For example, if we have a video card with a memory of 500 mb, only a small part of the space corresponds to the screen.
In general, the video card operation video memory is relatively fast, there are a variety of technical implementation, different families are different. But it is faster than the memory, or at least it is about the same speed as the memory, and the big data processing will be faster.
For example, if you want to move from offscreen to online screen, in the embedded system, the memory is also memory and only one block is allocated. However, this operation generally does not require CPU involvement.

Okay. Let's take a look at the display process,
Generally, the driver writes a function to process the process.
In fact, there are only two things to pay attention to. One is the processing of clip and the other is the processing of composite.
Although xvideo is designed to deviate from composite, xvideo can be directly output to a texure or a pixmap. No matter how it is, it is nothing more than a video memory.
R128displayvideo
Let's look at this implementation.
Boxptr pbox = region_rects (& pportpriv-> clip );
Int nbox = region_num_rects (& pportpriv-> clip );

Here we calculate clip. For example, when we play a video, there is a menu to cover the screen, so that our output area is cut into multiple areas, 2 or so.
We can only output small pieces of data separately, but it is still hardware acceleration. Let's look at the loop and we will know.
While (nbox --){
Int srcx, srcy, dstx, dsty, srcw, srch, dstw, DTH;

Dstx = pbox-> X1 + dstxoff;
Dsty = pbox-> Y1 + dstyoff;
Dstw = pbox-> x2-pbox-> x1;
Dsomething = pbox-> Y2-pbox-> Y1;
Srcx = (pbox-> x1-pportpriv-> dst_x1 )*
Pportpriv-> src_w/pportpriv-> dst_w;
Srcy = (pbox-> Y1-pportpriv-> dst_y1 )*
Pportpriv-> src_h/pportpriv-> dst_h;
Srcw = pportpriv-> src_w-srcx;
Srch = pportpriv-> src_h-srcy;

Begin_dma (6 );
Out_ring (dma_packet0 (r128_reg_scale_src_height_width, 2 ));
Out_ring_reg (r128_reg_scale_src_height_width,
(Srch <16) | srcw );
Out_ring_reg (r128_reg_scale_offset_0, pportpriv-> src_offset +
Srcy * pportpriv-> src_pitch + srcx * 2 );

Out_ring (dma_packet0 (r128_reg_scale_dst_x_y, 2 ));
Out_ring_reg (r128_reg_scale_dst_x_y, (dstx <16) | dsty );
Out_ring_reg (r128_reg_scale_dst_height_width,
(Dsomething <16) | dstw );
End_dma ();
Pbox ++;
}

We will not look at out_ring in it. To put it simply, it is hardware output.
Different hardware types have the same purpose, that is, the output area. Nbox is the number of our regions.
We need to output the buffer to a specified region, and the specified offset is enough.

As for composite, we only need to specify the output area. If this area is not online screen, we will output it to pixmap.
Then, composite manages the output, which may be a program such as windowmanager or xcompmgr.
Damagedamageregion (pportpriv-> pdraw, & pportpriv-> clip );

This is required. For the output of XV framework, we need to report that we have written this area. If someone is using this area, we need to know this.

In fact, the XV framework involves a lot of extensions.
XV, xcomposite, xdamage, xfixes,

This is basically the case. Other functions are auxiliary functions, just copy them.

Another point to note is. Different flag functions.
Video_clip_to_viewport and so on. This is set based on different display modes.
Overlay and texture are different. It seems that overlay is rarely used now. However, because it is supported by hardware, it may be faster than texture. After all, a single layer operation
The hardware alpha blending will be much faster, while texture's output depends entirely on the video card. It seems that although the current video card is very strong, it faces the alpha blending of a large amount of data.
Whether it is the memory size, data transmission, or computing power, it is still not enough. This is true for PCs and for embedded systems.
Think about the current large resolution 1440x900. This is a general small screen, 24bit depth, and then add composite, the demand is doubled. There is a lot of memory usage, and all windows need to be occupied first.
Offscreen render, and then we will talk about alpha blending, which is very slow. I don't know what composite is doing. To be cool, I have to pay more for it.
Alas, it's still a little slow to use fcitx for full typing. The above stuff has been written for an hour and a half, and there are still many other things.
Kaa does not know when to write it.
Kdriver + Kaa + kxv
Xorg + exa + XV
The two frameworks are similar. Xorg is more scalable and can be dynamically loaded.
The Xorg input automatic detection is also relatively simple. It takes only a few days for evdev to unify the world.
Dizzy, the input is actually superfluous, I can do it too :)
It seems that I write a novel very quickly. Can I write more than 10 thousand words in an hour and a half?

The missing point is why xvideo is used.
As mentioned in the previous article, the data we decoded from MPEG4 or h264 is actually YUV data, for example, it may be yuv422.
However, our screen is actually RGB, and different systems are different. For example, our PC is generally rgb888. For embedded systems, rgb565 may be used, so conversion is required,
This YUV-to-RGB conversion requires a lot of computing, and hardware scaling. For example, if the decoded image is VGA or WVGA, we need to scale it when it is displayed in full screen, CPU usage is also very high.
If you remember correctly, it should all be a floating point operation. Although coretex-a8 and so on have VFP and other commands, but in fact, this performance is far from enough. In addition, xvideo is designed for rapid output,
Therefore, the framework is simplified and designed for rapid display.

Note that profile is the most important thing. We usually look at the profile to see where the xvideo implementation performance is. Because the output is made by hardware, the CPU usage is very small,
Later, when I used oprofile to view the data, I found that the most CPU used actually exists in the Data encapsulation area. That is, we encapsulated the data transmitted by the user program into a hardware supported format.
In this case, we can use assembly optimization. The improvement is obvious. After all, it is the most "hot" Place in the driver.

However, since it was migrated to Xorg, the implementation of kdrive, including the implementation of Kaa driver and xvideo, was no longer used.
Xorg has a chip provider developing Xorg's dri + exa driver, so I didn't do this anymore.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More