
|
CSI: Motivation
In order to meet high performance requirements imposed by multimedia applications,
most of the vendors of the general-purpose CPUs
have enhanced the instruction set architectures (ISAs) of their processors with
the special media extensions.
Examples of such media extensions are
the Multimedia Extension (MMX) and Internet Streaming SIMD Extension (SSE) of
Intel,
the Visual Instruction Set (VIS) of
Sun Microsystems,
and the AltiVec extension of Motorola.
All these extensions are load-store vector architectures. The instructions operate
on vectors contained in the vector registers (or, multimedia registers) and process all the vector elements
in parallel, as shown in the following picture:
Figure 1:
The Altivec addition instruction vaddubm.
-->
|
The MMX-like extensions have proven to improve performance of multimedia applications executed on
general-purpose CPUs equipped with them.
However,
the performance of such processors
is limited by the number of media instructions to be executed
. We illustrate this statement with the following
example.
To avoid this limitation,
the number of instructions should be reduced.
Out of 95 instructions per iteration of the loop presented in the
example, 19 are associated with the sectioning and packing/unpacking overhead, and the remaining 76 implement the useful computation.
CSI architecture reduces the number of executed instructions by
Eliminating sectioning instructions
Eliminating packing/unpacking instructions
Including new complex media instructions
To implement the first two techniques, CSI includes instructions that operate on
memory-located data streams of arbitrary length and perform packing and unpacking next to
the main computation. The third technique allows an additional instruction count
reductions and is employed
if the multiple operations required for the main computation could be
grouped and compounded into a single complex operation which
does not require more cycles than the simple ones and is implementable in
the current technology. In such a case,
a new instruction that specifies this complex operation can be included in CSI.
For example, the Paeth Prediction kernel
lends itself to such a technique [HV01].
When all three methods are applied, the whole kernel, which requires multiple iterations
and 95 Altivec instructions per iteration (see Figure 3), is reduced to several setup and a single processing
instruction:
Figure 4:
structure of CSI code for Paeth_predict_row kernel.
 |
|
|