One, vector processing method
1)横向处理方式2)纵向处理方式3)纵横处理方式
below to calculate the expression D = A * (B + C)
(1) Transverse treatment method
逐个求出结果向量的各个元素
D1=a1x (B1+C1)
D2=A2X (B2+C2)
...
Dn=anx (BN+CN)
Each of the N components in D is first added K←b1+c1, where K is the staging unit and then multiplied by D1←KXA1.
In the multiplication of each vector element, data-related situations occur, and when a static pipeline is used, 2 times
Conversion of multiply and add functions. This will cause a total of n-related and 2N-function conversions. Therefore, this transverse machining method is only suitable for scalar
loop algorithm, not suitable for vector pipelining.
(2) Longitudinal processing method
纵向(垂直)加工方式:先对所有元素执行一种相同的运算,再对所有元素执行另一种相同的运算。
First Count:
K1=b1+c1
K2=b2+c2
...
Kn=bn+cn
Another count:
D1=a1xk1
D2=a2xk2
...
Dn=anxkn
There is only one data correlation between the two vector instructions, and the pipelining function switches only once.
(3) Vertical and horizontal processing method
The vector elements are grouped, and the vertical machining method is used in each group, and the horizontal machining is used among the groups.
Because the length of the vector register is limited. When the vector length exceeds the maximum n that the vector register can represent, it must be processed in groups.
Set vector length N, then: n=sxn+r
Where N≤n,r
First set of calculations:
K1~n=b1~n+c1~n
D1~n=a1~nxk1~n
The second set of calculations:
kn+1~2n=bn+1~2n+cn+1~2n
dn+1~2n=an+1~2nxkn+1~2n
...
Each group has two vector instructions, each group has a data-related, need 2 water function switch, need n intermediate vector register unit.
The structure of the vector processor
(1) Functional parts
共有12条可并行工作的单功能流水线,可分别流水地进行地址、向量、标量的各种运算。
6 single-function pipelining parts: integer plus, logical operation, SHIFT, floating-point plus, floating-point multiplication, and floating-point iterative negation.
(2) Vector Register Group V
每个向量寄存器可以每拍向功能部件提供一个数据元素,或者每拍接收一个从功能部件来的结果元素。
(3) Scalar register s and fast scratchpad t
标量寄存器有8个:S0-S7,都是64位。快速暂存器T用于在标量寄存器和存储器之间提供缓冲。
(4) Vector screen Register VM
向量屏蔽寄存器VM为64位,每一位对应于向量寄存器的一个单元。其作用是用于向量归并、压缩、还原和测试操作等。也可用于实现对向量某些元素的单独运算。
Third, the method of providing the performance of vector processing machine
(1) Set up multiple functional parts so that they work in line
在向量处理机中,为了提高性能,通常都设置多个独立的功能部件,这些部件能并行工作,并各自按流水方式工作,从而形成了多条并行工作的运算操作流水线。
(2) Using link technology to speed up the execution of a series of vector instructions
链接技术是指具有先写后读相关的两条指令,在不出现功能部件冲突和源向量冲突的情况下,可以把功能部件链接起来进行流水处理,以达到加快执行的目的。链接技术实际上可以看成是流水线的定向技术在向量处理机中的应用。当前一个向量功能部件产生第1个结果并送到向量寄存器的入口时,该结果立即送往下一个功能部件的入口,开始后续的向量操作,此后依次得到的中间结果都按此处理。要进行链接执行的向量指令的向量长度必须相等,否则无法进行链接。
(3) Use of cyclic mining technology to speed up the processing of the cycle
当向量的长度大于向量寄存器的长度时,必须把长向量分成长度固定的段,然后循环分段处理,每一次循环只处理一个向量段。这种技术称为分段开采技术。
(4) Using multiprocessor system to further improve performance
Iv. evaluation parameters and methods of vector processing performance
4.1 Processing time of the vector instruction TVP
On a vector processor, the time TVP of executing a vector instruction with a vector length of n can be expressed as:
Tvp=Ts+Tvf+(n-1)Tc
TS is the settling time of the vector pipelining, which includes the setting of the starting address of the vector, the counter plus 1, the conditional transfer instruction execution and so on.
TVF is the flow time of the vector pipelining, which is the time when an instruction is decoded from the beginning and the first result is obtained through the pipeline.
TC is the execution time of the pipeline "bottleneck" segment.
If there is no "bottleneck" stream segment, the execution time for each segment equals one clock cycle, the above can also be written as:
Tvp=[s+e+(n-1)]τ
where S is the number of clock cycles required for the vector pipeline to establish time, E is the number of clock cycles required for the vector pipeline to flow through time, and N is the
Length, tau is the clock cycle.
The execution time of a set of vector operations depends mainly on the following three factors: the length of the vector, whether there is a flow function between the vector operations
of conflict and data dependencies.
1) A vector instruction that can start executing together within a clock cycle is called a formation. (There must be no conflict and no flow function parts
-raw, except for the relevance of the data)
2) There are conflicting and related instructions divided in different formations.
3) The run time of a formation is the longest running time in the team
When the vector length is greater than the vector register length, segmented mining is required. The overhead of a segmented mining consists of the overhead of executing a scalar code tloop and each
The formation of the vector start overhead tstart composition. So the entire execution time of a vector-length n-operation (program segment) is:
4.2 Maximum Performance r∞
R∞ represents the maximum performance of a vector pipeline when the vector length is infinity. Often used in the evaluation of peak performance, in units of MFlops. It
Can be expressed as:
4.3 n? Half performance length
N? Is the length of the vector required to reach half the r∞ value. It is a parameter that evaluates the performance impact of a vector pipeline settling time, expressed as a set of flow
The loss of operation caused by the waterline.
When the vector length n=n? , it means that only half of the entire vector pipelining time is doing effective operations, while the other half is wasted
Out of it. Usually hope that the vector pipeline has a smaller n?.
∵TVP =TO+NTC = (s+n) Tc
∴ when n? =to/tc=s, it means half the time to do effective operation, half the time for consumption.
4.4 Vector Length Critical value NV
向量流水方式的工作速度优于标量串行方式工作时所需的向量长度临界值。该参数既衡量建立时间,也衡量标量、向量速度比对性能的影响。
Computer system Structure pipelining technology-vector