
|
CSI: Limitations of MMX-like Extensions (Example)
We consider the piece of C-code presented
on the left-hand side of Figure 3.
It is extracted from an implementation of the Portable Network Graphics standard
(PNG), a popular standard for image compression and
decompression.
This code fragment computes the Paeth prediction for each pixel of the current row, starting from the second pixel.
The Paeth prediction scheme selects from the 3 neighboring pixels a, b, and
c which surround the current pixel d as depicted in Figure 2,
the pixel that differs the least from the initial prediction p=a+b-c. The selected pixel is called the Paeth prediction for d.
Figure 2:
a, b, c, d
 |
Figure 3:
The C-code of the Paeth prediction kernel and the structure of its AltiVec implementation.
 |
The right-hand side of Figure 3 presents
the structure of an AltiVec implementation of the kernel.
We grouped consecutive instructions according to the operations
they collectively perform.
Each group is represented by a box. The number in the down-right corner shows
how many instructions belong to the group.
The AltiVec implementation is organized as a loop, each iteration of which
processes 16-byte sections of
a, b, and c pixels streams.
and computes the Paeth
prediction for 16 consecutive pixels.
The Loop initialization group initializes the loop counter,
and the pointers to the source data streams.
Instructions of the Load Data group load in the AltiVec media
registers 16-byte sections of
a, b, and c streams.
The Unpack group converts (or unpacks) loaded pixels from the 8-bit storage
to the 16-bit computation format. The 16-bit format is needed
to avoid loss of precision.
The Compute group consists of 76 instructions and
computes the Paeth prediction for a section of 16 consecutive d pixels.
The Pack group converts (or, packs)
the resulting Paeth prediction values
back to the 8-bit storage format.
The single instruction of the Store group writes the produced 16 8-bit
predictions to the memory.
The Pointer Update group consist of three instructions which increment the pointers so
that the consequent stream sections will be accessed during
the next iteration.
The instructions of the Loop control group update the loop counter,
test the loop termination condition, and branch.
Finally, the instructions from the Loop postprocessing. may be executed
after the loop is terminated in order to process
the remaining parts of the streams in the scalar (non-AltiVec) mode.
From Figure 3 we observe that
a loop iteration computes 16 Paeth prediction values and
requires execution of
3 (load)+ 6(unpack)+ 76(compute)+ 1(pack)
+1(store) + 2(miscellaneous) + 3 (pointer update) + 3 (loop control)=95 instructions.
This implies that, on a 4-way issue processor,
the execution time of the loop cannot be reduced below
cycles per iteration.
|
|