
|
Implementation of CSI
CSI introduces complex instructions.
Introducing such complex instructions is meaningful only
if they are implementable
at acceptable cost in contemporary technology.
Their complexity can be attributed to the
following two features:
- Complexity of the main operation
- Complexity of the miscellaneous operations: address generation,
data access, packing/unpacking, etc.
Implementability of the complex arithmetic operations, such as SAD, Paeth
prediction, was addressed, for example, in the
paper presented in the Journal of VLSI Signal Processing (vol. 28, 2001).
Implementability of the miscellaneous operations was addressed in the
paper
presened at the Euromicro 2002 DSD Symposium. There, we have described a
general organization of a unit capable of executing CSI instructions (or, the
stream unit) and have studied the complexity of the address generation.
Below, we give a short sketch of that paper.
A CSI instruction such as csi_add loads the source streams from
memory, unpacks (if necessary) the stream elements from storage
to computational format, performs a certain operation on corresponding
elements, packs (again if necessary) the results,
and stores the resulting output stream back to memory.
Since these operations are independent, they can be pipelined.
The CSI execution unit is, therefore, organized as a pipeline
in which stream data flows through a sequence of stages that
perform these operations.
The datapath of the streaming execution unit is depicted in
Figure 7.
For clarity, some parts (for example, floating-point hardware)
have been omitted.
Figure 7:
Datapath of the streaming execution unit
|
The memory interface unit is responsible for transferring data
between the memory hierarchy and the stream input buffers.
In addition, if the source stream elements are not stored consecutively,
it must also extract and store them consecutively in the stream buffers.
If the destination stream elements are not stored consecutively,
the unit must perform the reverse operation, scattering data into
appropriate memory locations.
Each unpack unit converts stream data from storage format to
computational format (if required).
The functional units medADD and medMUL perform
SIMD parallel operations
The medADD unit performs the usual addition, subtraction,
and bitwise logical operations, as well as addition-related
operations such as the Paeth operation [see HV01 paper].
The medMUL unit performs the packed multiply operation and could
also incorporate more complex media operations such as the
Sum of Absolute Difference (SAD).
From the output register, data flows
to the stream output buffer via the pack unit.
The pack unit converts, if necessary, the data from computational format
to storage format under control of the Misc register of
the destination stream.
When no conversion is needed, data is passed
through the unit without being changed.
The design of the CSI execution unit is strongly influenced by the
memory hierarchy level it is connected to.
It can be connected to the first-level (L1) cache, or it
can bypass the L1 cache and go directly to the
L2 cache or even main memory.
We studied complexity of address generation for a CSI unit interfaced to
the L1 cache. The decision to interface CSI unit to this cache was
motivated by the fact that with realistic L1 cache sizes, most multimedia
applications achieve high hit rates and possibility to
widen the path between the cache and the
streaming execution unit, since L1 is on-chip.
We have shown that address-generation computations required for such a design
are implementable. For details, please refer to the
Euromicro-2002 paper.
|
|