From: http://linux.chinaunix.net/bbs/thread-1125926-1-1.html
Experience of arm VFP
Keywords: VFP arm1136JF-S mcimx31 GCC Linux
References: <ARM1136JF-S and ARM1136J-S Technical Reference Manual
R1p3 >>,< ARM architecture reference
Manual >>,< vfp11 vector floating-point coprocessor
ARM1136JF-S processor r1p3 Technical Reference Manual>
Preface:
Mcimx31 is a ARM1136JF-S-based multimedia processor. He is suitable for smart handheld devices such as smartphones, handheld game terminals, and multimedia players. For details about the parameter features of this CPU, refer to the Freescale datasheet.
Recently, the mcimx31 VFP has some gains and I will share it with you.
Debugging environment:
I. mx31 Development Board of Chengdu Leide Technology Co., Ltd. ([url = http://www.nidetech.com/?http://www.nidetech.com#/url.
1. Features of VFP
In my opinion, in addition to providing support for basic floating-point operations (addition, subtraction, multiplication, division, start, comparison, and inversion), VFP features its vector function. It supports up to 8 groups at the same time
Single-precision four groups of double-precision floating-point operations. For more information about this part, see <ARM architecture reference
Manual> Chapter C5 VFP addressing
Modes. Let's take a look at a program instance. The program is compiled with arm-None-Linux-gnueabi-GCC 4.1.2 and runs on mcimx31 Linux
2.6.24.5.
# Include <unistd. h>
# Include <stdio. h>
Void vfp_regs_load (float arrays [32])
{
ASM volatile ("fldmias % 0, {s0-s31}/N"
:
: "R" (arrays ));
}
Void vfp_regs_save (float arrays [32])
{
ASM volatile ("fstmias % 0, {s0-s31 }"
:
: "R" (arrays ));
}
Void print_array (float array [32])
{
Int I;
For (I = 0; I <32; I ++)
{
If (I % 8 = 0)
Printf ("/N ");
Printf ("% F", I, array [I]);
}
Printf ("/N ");
}
Int main ()
{
Unsigned int fpscr;
Float F1 = 1.0, f2 = 1.0;
Float farrays [32], farrays2 [32];
Int I;
Fpscr = 0x130000;
ASM volatile ("fmxr fpscr, % 0/N"
:
: "R" (fpscr ));
ASM volatile ("fmrx % 0, fpscr/N"
: "= R" (fpscr ));
Vfp_regs_save (farrays2 );
For (I = 0; I <32; I ++)
Farrays [I] = F1 + F2 * (float) I;
Vfp_regs_load (farrays );
Vfp_regs_save (farrays2 );
Printf ("/N1: scalara op scalarb-> scalard ");
Vfp_regs_load (farrays );
ASM volatile ("fadds S0, S1, S2 ");
Vfp_regs_save (farrays2 );
Print_array (farrays2 );
Printf ("/N2: vectora [?] OP scalarb-> vectord [?] ");
Vfp_regs_load (farrays );
ASM volatile ("fadds S8, S24, S0 ");
Vfp_regs_save (farrays2 );
Print_array (farrays2 );
Printf ("/N3: vectora [?] OP vectorb [?] -> Vectord [?] ");
Vfp_regs_load (farrays );
ASM volatile ("fadds S8, S16, S24 ");
Vfp_regs_save (farrays2 );
Print_array (farrays2 );
}
Running result:
1: scalara op scalarb-> scalard
5.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000
9.000000 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000
17.000000 18.000000 19.000000 20.000000 21.000000 22.000000 23.000000
25.000000 26.000000 27.000000 28.000000 29.000000 30.000000 31.000000
2: vectora [?] OP scalarb-> vectord [?]
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000
26.000000 10.000000 28.000000 12.000000 30.000000 14.000000 32.000000
17.000000 18.000000 19.000000 20.000000 21.000000 22.000000 23.000000
25.000000 26.000000 27.000000 28.000000 29.000000 30.000000 31.000000
3: vectora [?] OP vectorb [?] -> Vectord [?]
1.000000 2.000000 3.000000 4.000000 5.000000 6.000000 7.000000
42.000000 10.000000 46.000000 12.000000 50.000000 14.000000 54.000000
17.000000 18.000000 19.000000 20.000000 21.000000 22.000000 23.000000
25.000000 26.000000 27.000000 28.000000 29.000000 30.000000 31.000000
The first case is the simplest sum of two floating point numbers (fadds S0, S1, S2). We can see that S0 (5.00) = S1 (2.00) + S2 (3.00 ).
The second case is to add a group of vectors and a scalar (fadds S8, S24, S0). The result is:
S8 (26.00) = S24 (25.00) + S0 (1.00)
S10 (28.00) = s26 (27.00) + S0 (1.00)
S12 (30.00) = s28 (29.00) + S0 (1.00)
S14 (32.00) = S30 (31.00) + S0 (1.00)
The third case is to add two groups of vectors (fadds S8, S16, S24). The result is:
S8 (42.00) = S24 (25.00) + S16 (17.00)
S10 (46.00) = s26 (27.00) + S18 (19.00)
S12 (50.00) = s28 (29.00) + S20 (21.00)
S14 (54.00) = S30 (31.00) + s22 (23.00)
As to why there are four groups of results, and the adjacent results are separated by one. For more information, see <vfp11? Vector floating-point
Coprocessor for ARM1136JF-S processor r1p3 Technical Reference
Manual> fpscr description.
2. Hardware Support
The ARM1136JF-S implements VFP through two coprocessor cp10 and cp11. Cp10 supports single-precision floating-point operations, and cp11 supports double-precision floating-point operations. So all
VFP commands are actually some coprocessor commands. For example, fadds is actually a CDP command, and FLDS is a LDC command. Theoretically, As long as arm1136jf-
S CPU should be able to support VFP.
3. compiler support for VFP
A floating point number operation is finally translated into a VFP command, or translated into FPA, or softfloat is determined by the compiler. Instance:
[SJL @ sjl vfp] $ cat F. C
Int main ()
{
Float F1 = 1.2, f2 = 1.3;
F1 = F2 * F1;
}
[SJL @ sjl vfp] $ arm-Linux-gcc-V
....
GCC version 3.4.4
[SJL @ sjl vfp] $ arm-Linux-gcc-c f. C
[SJL @ sjl vfp] $ arm-Linux-objdump-D f. o
F. O: File Format elf32-littlearm
Disassembly of section. Text:
00000000 <main>:
0: e1a0c00d mov IP, SP
4: e92dd800 analytic dB SP !, {FP, IP, LR, PC}
8: e24cb004 sub FP, IP address, #4; 0x4
C: e24dd008 sub sp, SP, #8; 0x8
10: e59f3024 LDR R3, [PC, #36]; 3C <. Text + 0x3c>
14: e50b3010 STR R3, [FP, #-16]
18: e59f3020 LDR R3, [PC, #32]; 40 <. Text + 0x40>
1c: e50b3014 STR R3, [FP, #-20]
20: ed1b1104 ldfs F1, [FP, #-16]
24: ed1b0105 ldfs F0, [FP, #-20]
28: ee910100 fmls F0, F1, F0
2c: ed0b0104 STFS F0, [FP, #-16]
30: e1a00003 mov r0, r3
34: e24bd00c sub sp, FP, #12; 0xc
38: e89da800 ldmia sp, {FP, SP, PC}
3c: 3f99999a swicc 0x0099999a
40: 3fa66666 swicc 0x00a66666
We use arm-Linux-GCC 3.4.4 to compile the command obviously, instead of the VFP command.
[SJL @ sjl vfp] $ arm-None-Linux-gnueabi-gcc-V
....
GCC version 4.1.2
[SJL @ sjl vfp] $ arm-None-Linux-gnueabi-gcc-c f. C
[SJL @ sjl vfp] $ arm-None-Linux-gnueabi-objdump-D f. o
F. O: File Format elf32-littlearm
Disassembly of section. Text:
00000000 <main>:
0: e1a0c00d mov IP, SP
4: e92dd800 analytic dB SP !, {FP, IP, LR, PC}
8: e24cb004 sub FP, IP address, #4; 0x4
C: e24dd008 sub sp, SP, #8; 0x8
10: e59f3024 LDR R3, [PC, #36]; 3C <. Text + 0x3c>
14: e50b3014 STR R3, [FP, #-20]
18: e59f3020 LDR R3, [PC, #32]; 40 <. Text + 0x40>
1c: e50b3010 STR R3, [FP, #-16]
20: e51b0014 LDR r0, [FP, #-20]
24: e51b1010 LDR R1, [FP, #-16]
28: ebfffffe BL 0 <__ aeabi_fmul>
2c: e1a03000 mov R3, R0
30: e50b3014 STR R3, [FP, #-20]
34: e24bd00c sub sp, FP, #12; 0xc
38: e89da800 ldmia sp, {FP, SP, PC}
3c: 3f99999a svccc 0x0099999a
40: 3fa66666 svccc 0x00a66666
We use arm-None-Linux-gnueabi-GCC 4.1.2 to generate VFP commands by default.
[SJL @ sjl vfp] $ arm-None-Linux-gnueabi-gcc-mfpu = VFP-mfloat-Abi = softfp-c f. C
[SJL @ sjl vfp] $ arm-None-Linux-gnueabi-objdump-D f. o
F. O: File Format elf32-littlearm
Disassembly of section. Text:
00000000 <main>:
0: e1a0c00d mov IP, SP
4: e92dd800 analytic dB SP !, {FP, IP, LR, PC}
8: e24cb004 sub FP, IP address, #4; 0x4
C: e24dd008 sub sp, SP, #8; 0x8
10: e59f3020 LDR R3, [PC, #32]; 38 <. Text + 0x38>
14: e50b3014 STR R3, [FP, #-20]
18: e59f301c LDR R3, [PC, #28]; 3C <. Text + 0x3c>
1c: e50b3010 STR R3, [FP, #-16]
20: ed1b7a05 FLDS S14, [FP, #-20]
24: ed5b7a04 FLDS S15, [FP, #-16]
28: ee677a27 fmuls S15, S14, S15
2c: ed4b7a05 fsts S15, [FP, #-20]
30: e24bd00c sub sp, FP, #12; 0xc
34: e89da800 ldmia sp, {FP, SP, PC}
38: 3f99999a svccc 0x0099999a
3c: 3fa66666 svccc 0x00a66666
Use arm-None-Linux-gnueabi-GCC 4.1.2 to specify the-mfpu = VFP-mfloat-Abi = softfp parameter and then generate the VFP command.
It seems that VFP is supported only after GCC 4. If you want to use VFP in the original GCC 3, you can think about it together.
4. Support for VFP in the operating system
The application must use VFP commands and the operating system must be used together.
There are several important coprocessors in the ARM1136JF-S that are related to VFP.
CP15 C1 coprocessor access control register, which specifies the access permissions of user mode and privilege to coprocessor. We need to use VFP, of course, to run the user mode to access cp10 and cp11.
Another register is the fpexc bit30 of VFP, which is the bit used by the VFP function.
In fact, after the operating system has done these two things, the user program can use VFP.
Example:
Compile the kernel to cancel VFP support. Compile a kernel driver and add the following code:
Void enable_vfp (void)
{
Int ret = 0;
Unsigned int value;
ASM volatile ("MRC P15, 0, % 0, C1, C0, 2"
: "= R" (value)
:);
Value | = 0xf00000;/* enable cp10, cp11 user access */
ASM volatile ("MCR P15, 0, % 0, C1, C0, 2"
:
: "R" (value ));
ASM volatile ("fmrx % 0, fpexc"
: "= R" (value ));
Value | = (1 <30 );
ASM volatile ("fmxr fpexc, % 0"
:
: "R" (value ));
}
Compile an application:
Int main ()
{
Float F1 = 1.2, f2 = 1.3;
F1 = F2 * F1;
Printf ("% F/N", F1 );
}
Adsdebian:/dev/SHM #./A. Out
1.560000
See it. The result is correct.
However, we can see that the kernel's VFP support code is still doing a lot of other things.
Let's think about this:
8374: ed1b7a05 FLDS S14, [FP, #-20]
8378: ed5b7a04 FLDS S15, [FP, #-16]
837c: ee677a27 fmuls S15, S14, S15
8380: ed4b7a05 fsts S15, [FP, #-20]
8384: ed5b7a05 FLDS S15, [FP, #-20]
We know that S14 and S15 are coprocessor registers and belong to the resources shared by all processes. So when the above example is executed to 0x8378, the program switches to another process, and this other process also accesses S14, then the result can be imagined.
Instance:
Load the Linux kernel that disables VFP mentioned above and load a driver to modify the corresponding coprocessor register to run the user program to execute the VFP command.
[SJL @ sjl vfp] $ cat fork. c
# Include <unistd. h>
# Include <stdio. h>
Int main ()
{
Float F1 = 1.0, f2 = 1.0;
Pid_t PID;
Fork ();
PID = getpid ();
While (1)
{
F1 + = F2;
Printf ("% d % F/N", PID, F1 );
}
}
[SJL @ sjl vfp] $ arm-None-Linux-gnueabi-gcc-O2-mfpu = VFP-mfloat-Abi = softfp fork. c
[SJL @ sjl vfp] $ arm-None-Linux-gnueabi-objdump-d a. Out | less
...
Optional 83b4 <main>:
83b4: e92d4010 stmdb SP !, {R4, LR}
83b8: ed2d8b03 f0000dbx SP !, {D8}
83bc: e24dd004 sub sp, SP, #4; 0x4
83c0: ebffffc7 BL 82e4 <. text-0x18>
83c4: ebffffc3 BL 82d8 <. text-0x24>
83c8: ed9f8a08 FLDS S16, [PC, #32]
83cc: eef08a48 fcpys S17, S16
83d0: e1a04000 mov R4, R0
83d4: ee388a28 fadds S16, S16, S17
83d8: eeb77ac8 fcvtds D7, S16
83dc: e1a01004 mov R1, r4
83e0: ec532b17 fmrrd R2, R3, D7
83e4: e59f0008 LDR r0, [PC, #8]; 83f4 <. Text + 0xf8>
83e8: ebffffc0 BL 82f0 <. text-0xc>
83ec: eafffff8 B 83d4 <main + 0x20>
83f0: 3f800000 svccc 0x00800000
83f4: Listen 847c andeq R8, R0, IP, Ror r4
...
Execution result:
.....
1382 915.000000
1382 916.000000
1381 2.000000
1381 917.000000
1381 918.000000
....
We can see that the execution result of process 1381 is different from our program design intent.
After the above example, we know that in order to enable normal execution of applications, the operating system needs to save the VFP field of the process. Of course, there are many other things to be done to support the VFP operating system.
It seems that the operating system should at least save the application to save the VFP register value during process switching. Let's see how Linux works.
5. VFP code scenarios
During the Process switchover, the kernel function call process is as follows (arm arch ):
_ Schedule ()-> context_switch ()-> switch_to ()->__ switch_to ()
_ Switch_to () in arch/ARM/kernel/entry-armv.S implementation, this code is not long, here I am more concerned about is
...
MoV R5, R0
Add R4, R2, # ti_cpu_save
LDR r0, = thread_policy_head
MoV R1, # thread_policy_switch
BL atomic_notifier_call_chain
MoV r0, R5
...
Translate to C: atomic_notifier_call_chain (thread_policy_head, thread_policy_switch, next-> cpu_context );
Go to our VFP code,
Static struct notifier_block vfp_notifier_block = {
. Notifier_call = vfp_notifier,
};
Vfp_init ()
{
...
Thread_register_notifier (& vfp_notifier_block );
...
}
It is obvious that the process will be executed in vfp_notifier () during the switchover. The vfp_notifier () code is carefully studied, and no code for saving the VFP register is found. This code is suspicious:
...
Fmxr (fpexc, fpexc &~ Fpexc_en );
...
Fpexc fpexc_en is the Enable bit of the VFP function. If this is not set, the CPU will generate an "undefined instruction" interruption when executing the VFP command. All right, fpexc_en is not set every time a new process is switched to the CPU. In this case, a VFP command causes an "undefined command" interruption.
Okay. The program runs to _ und_usr ()-> call_fpe ()-> do_vfp (). This process is actually very complicated. If you are interested, you can read the code carefully.
Do_vfp:
...
LDR R4,. lcvfp
LDR R11, [R10, # ti_cpu] @ CPU number
Add R10, R10, # ti_vfpstate @ R10 = Workspace
Ldr pc, [R4] @ call VFP entry point
...
. Lcvfp:
. Word vfp_vector
Run vfp_vector. the pointer of the vfp_vector function is assigned vfp_vector = vfp_support_entry in vfp_init;
The key is in vfp_support_entry. Vfp_support_entry is not only responsible for saving/restoring the VFP register value of the process, but also for handling VFP exceptions (such as division of 0 and so on ).
You can take a look at the specific implementation process.