Δ-ILIAD Home

CSI Home

Motivation

CSI Architecture

Implementation

Results

Implementation of CSI

CSI introduces complex instructions. Introducing such complex instructions is meaningful only if they are implementable at acceptable cost in contemporary technology. Their complexity can be attributed to the following two features:
  • Complexity of the main operation
  • Complexity of the miscellaneous operations: address generation, data access, packing/unpacking, etc.
Implementability of the complex arithmetic operations, such as SAD, Paeth prediction, was addressed, for example, in the paper presented in the Journal of VLSI Signal Processing (vol. 28, 2001). Implementability of the miscellaneous operations was addressed in the paper presened at the Euromicro 2002 DSD Symposium. There, we have described a general organization of a unit capable of executing CSI instructions (or, the stream unit) and have studied the complexity of the address generation. Below, we give a short sketch of that paper.

A CSI instruction such as csi_add loads the source streams from memory, unpacks (if necessary) the stream elements from storage to computational format, performs a certain operation on corresponding elements, packs (again if necessary) the results, and stores the resulting output stream back to memory. Since these operations are independent, they can be pipelined. The CSI execution unit is, therefore, organized as a pipeline in which stream data flows through a sequence of stages that perform these operations. The datapath of the streaming execution unit is depicted in Figure 7. For clarity, some parts (for example, floating-point hardware) have been omitted.

Figure 7: Datapath of the streaming execution unit
\begin{figure}\centerline{
\epsfig{figure=pics/thesis/impl/csi_dpath_general.eps,width=0.75\linewidth}
}\end{figure}
The memory interface unit is responsible for transferring data between the memory hierarchy and the stream input buffers. In addition, if the source stream elements are not stored consecutively, it must also extract and store them consecutively in the stream buffers. If the destination stream elements are not stored consecutively, the unit must perform the reverse operation, scattering data into appropriate memory locations. Each unpack unit converts stream data from storage format to computational format (if required). The functional units medADD and medMUL perform SIMD parallel operations The medADD unit performs the usual addition, subtraction, and bitwise logical operations, as well as addition-related operations such as the Paeth operation [see HV01 paper]. The medMUL unit performs the packed multiply operation and could also incorporate more complex media operations such as the Sum of Absolute Difference (SAD). From the output register, data flows to the stream output buffer via the pack unit. The pack unit converts, if necessary, the data from computational format to storage format under control of the Misc register of the destination stream. When no conversion is needed, data is passed through the unit without being changed.

The design of the CSI execution unit is strongly influenced by the memory hierarchy level it is connected to. It can be connected to the first-level (L1) cache, or it can bypass the L1 cache and go directly to the L2 cache or even main memory. We studied complexity of address generation for a CSI unit interfaced to the L1 cache. The decision to interface CSI unit to this cache was motivated by the fact that with realistic L1 cache sizes, most multimedia applications achieve high hit rates and possibility to widen the path between the cache and the streaming execution unit, since L1 is on-chip. We have shown that address-generation computations required for such a design are implementable. For details, please refer to the Euromicro-2002 paper.