Δ-ILIAD Home

CSI Home

Motivation

CSI Architecture

Implementation

Results

Experimental Results

We studied performance of the superscalar processors enhanced with CSI on the following media benchmarks, which represent different application domains: JPEG and MPEG-2 encoders/decoders (image and video coding/decoding subdomain), image-processing kernels from the Sun's VIS developer Kit (2-D image processing subdomain), and SPEC's viewperf (3-D graphics subdomain). Performance of the CSI-enhanced superscalar processors was compared to that of the processors extended with Sun's VIS or Intel's SSE media extensions. These studies were presented in the following papers: Below, we briefly review some of these results. The main goal of CSI is to reduce the number of executed instructions. Figure 8 depicts, for several media kernels, the ratio of the dynamic instruction count exhibited by the 4-way issue superscalar VIS-enhanced CPU to the instruction count exhibited by the same processor enhanced with the CSI execution hardware.
Figure 8: Instruction count reductions (times), CSI w.r.t. VIS
It can be observed that CSI, as expected, provides significant reductions in the instruction counts , which range from 16.07 times to 6.34 times, with the average reduction of 12.14 times. For complete applications CSI allows to reduce the number of executed instructions by a factor of up to 2.05 (djpeg--JPEG decoder). Reductions in instruction counts provide significant speedups. For example, 4-way issue superscalar processor with 64-entry instruction window enhanced with CSI execution unit capable of processing 32 bytes in parallel outperforms the same processor enhanced with VIS execution units with the same parallel processing capabilities by a factor of up to 7.8 on the kernel-level (add8 kernel --adding two images) and by a factor of up to 1.5 on the application level (djpeg-- JPEG decoder). Additionally, we find out that performance of CSI-enhanced processors scales much better than that of VIS-enhanced processors, if the amount of parallel processing hardware is increased.