Δ-ILIAD Home

CSI Home

Motivation

CSI Architecture

Implementation

Results

CSI: Limitations of MMX-like Extensions (Example)



We consider the piece of C-code presented on the left-hand side of Figure 3. It is extracted from an implementation of the Portable Network Graphics standard (PNG), a popular standard for image compression and decompression. This code fragment computes the Paeth prediction for each pixel of the current row, starting from the second pixel. The Paeth prediction scheme selects from the 3 neighboring pixels a, b, and c which surround the current pixel d as depicted in Figure 2, the pixel that differs the least from the initial prediction p=a+b-c. The selected pixel is called the Paeth prediction for d.
Figure 2: a, b, c, d
Figure 3: The C-code of the Paeth prediction kernel and the structure of its AltiVec implementation.

The right-hand side of Figure 3 presents the structure of an AltiVec implementation of the kernel. We grouped consecutive instructions according to the operations they collectively perform. Each group is represented by a box. The number in the down-right corner shows how many instructions belong to the group. The AltiVec implementation is organized as a loop, each iteration of which processes 16-byte sections of a, b, and c pixels streams. and computes the Paeth prediction for 16 consecutive pixels. The Loop initialization group initializes the loop counter, and the pointers to the source data streams. Instructions of the Load Data group load in the AltiVec media registers 16-byte sections of a, b, and c streams. The Unpack group converts (or unpacks) loaded pixels from the 8-bit storage to the 16-bit computation format. The 16-bit format is needed to avoid loss of precision. The Compute group consists of 76 instructions and computes the Paeth prediction for a section of 16 consecutive d pixels. The Pack group converts (or, packs) the resulting Paeth prediction values back to the 8-bit storage format. The single instruction of the Store group writes the produced 16 8-bit predictions to the memory. The Pointer Update group consist of three instructions which increment the pointers so that the consequent stream sections will be accessed during the next iteration. The instructions of the Loop control group update the loop counter, test the loop termination condition, and branch. Finally, the instructions from the Loop postprocessing. may be executed after the loop is terminated in order to process the remaining parts of the streams in the scalar (non-AltiVec) mode.

From Figure 3 we observe that a loop iteration computes 16 Paeth prediction values and requires execution of 3 (load)+ 6(unpack)+ 76(compute)+ 1(pack) +1(store) + 2(miscellaneous) + 3 (pointer update) + 3 (loop control)=95 instructions. This implies that, on a 4-way issue processor, the execution time of the loop cannot be reduced below $ \frac{95}{4}\geq 23$ cycles per iteration.