Δ-ILIAD Home

CSI Home

Motivation

CSI Architecture

Implementation

Results

CSI: Motivation

In order to meet high performance requirements imposed by multimedia applications, most of the vendors of the general-purpose CPUs have enhanced the instruction set architectures (ISAs) of their processors with the special media extensions. Examples of such media extensions are the Multimedia Extension (MMX) and Internet Streaming SIMD Extension (SSE) of Intel, the Visual Instruction Set (VIS) of Sun Microsystems, and the AltiVec extension of Motorola.

All these extensions are load-store vector architectures. The instructions operate on vectors contained in the vector registers (or, multimedia registers) and process all the vector elements in parallel, as shown in the following picture:
Figure 1: The Altivec addition instruction vaddubm.
\begin{figure}\centering\epsfig{file=pics/9dec02/altivec_code_v3.eps, height=7cm, width=0.9\linewidth}\end{figure} -->
The MMX-like extensions have proven to improve performance of multimedia applications executed on general-purpose CPUs equipped with them. However, the performance of such processors is limited by the number of media instructions to be executed . We illustrate this statement with the following example.

To avoid this limitation, the number of instructions should be reduced. Out of 95 instructions per iteration of the loop presented in the example, 19 are associated with the sectioning and packing/unpacking overhead, and the remaining 76 implement the useful computation. CSI architecture reduces the number of executed instructions by
  • Eliminating sectioning instructions

  • Eliminating packing/unpacking instructions

  • Including new complex media instructions

To implement the first two techniques, CSI includes instructions that operate on memory-located data streams of arbitrary length and perform packing and unpacking next to the main computation. The third technique allows an additional instruction count reductions and is employed if the multiple operations required for the main computation could be grouped and compounded into a single complex operation which does not require more cycles than the simple ones and is implementable in the current technology. In such a case, a new instruction that specifies this complex operation can be included in CSI. For example, the Paeth Prediction kernel lends itself to such a technique [HV01]. When all three methods are applied, the whole kernel, which requires multiple iterations and 95 Altivec instructions per iteration (see Figure 3), is reduced to several setup and a single processing instruction:
Figure 4: structure of CSI code for Paeth_predict_row kernel.