Sun's Visual Instruction Set
Sun's UltraSPARC is a superscalar, 4th generation RISC. It can issue
up to four instructions per pipeline cycle. It is also Sun's first 64-bit
machine the main datapaths are 64-bits wide, ideal for graphics or multimedia
data processing. It has four execution units: the Branch, Integer Execution,
the Load/Store and Floating-point/Graphics Units.
With UltraSPARC Sun, architects add special graphics/math processing
capabilities in the FPU, now known as the Floating-Point/Graphics Execution
Unit. The FPU/GEU can execute up to two floating-point/graphics operations
(FGops) and one Floating-Point load/store operation in each pipeline clock
cycle. These operations, except for divide and square root, are fully pipelined.
They can complete out-of-order without stalling the execution of FGops.
Figure 5 shows the UltraSPARC execution units,
including the FPU/GEU and its five sub-units. The FPU/GEC Unit works with
its own set of 32 double-precision floating-point registers. It also has
4 sets of FP condition code registers for more parallelism.

Figure 5: UltraSPARC execution units
In processing image data, the code uses the UltraSPARC 32 integer registers
for addressing the data, and the 32 FPU registers for manipulating images.
Pixel information is typically stored as four 8-bit or 16-bit integer values.
Typically these four values represent red, green, blue (RGB) and the alpha
(a ) information for the pixel. Figure 6 shows the UltraSPARC V9 packed data formats.

Figure 6: Partitioned data formats
The UltraSPARC Visual Instruction Set instructions include: pixel expand/pack,
partitioned add/MPY/Compare, align, edge handling, array addressing, merge,
pixel distance and logical operations. Figure
7 shows the addition of two 64-bit partitioned words with four 16-bit
components.

Figure 7: UltraSPARC partitioned multiply
The GNU assembler has a very modular structure. Adding the new Sun UltraSPARC
VIS instructions involved little more than adding a few operands to the
opcode table. The VIS graphics instructions are not represented in the
compiler. Most programmers use these operations in assembler, typically
in libraries and special utilities. However, GCC's in-line capability enables
developers to deploy the VIS multi-field processing features in key inner
loop routines and functions.
The assembler enables developers to write assembly code that uses instructions
such as the following example's input.
fpadd16 %f2,%f4,%f6
This instruction adds four 16-bit floating bit values in each of f2
and f4 and store the result into f6. Inside a C function,
we can use GCC's inline assembly feature with the following input.
typedef double pixel4x16;
static inline pixel4x16 fpadd16 (pixel4x16 a, pixel4x16
b)
{
pixel4x16 result;
asm("fpadd16 %1, %2,
%0", : "=e" (result) : "e" (a), "e"
(b));
return result;
}
This function will add together two sets of four 16-bit pixel values.
If we have a frame buffer, we can then take advantage of GCC's ability
to inline functions to write the following input.
vector_fpadd16 (pixel4x16 *vec1, pixel4x16
*vec2, pixel4x16 *res, int len)
{
while (len-- > 0) {
*res++
= fpadd16 (*vec1++, *vec2);
}
}
This routine will add together two vectors, or equivalently, two regions
of memory, comprised of pixel values. The code generated by the compiler
for this routine looks like the following output.
vector_fpadd16:
cmp %o3,0
! Check that len is not initially zero or negative.
ble %icc,L4
! Exit early if so.
nop
! (SPARC branch delay slot).
L5: ldd [%o1],%f50
! Get first set of four 16-bit pixel values.
add %o1,8,%o1
! Advance first set to next group of pixels.
ldd [%o0],%f48
! Get second set of pixel values.
add %o0,8,%o0
! Advance second set to next group of pixels
fpadd16 %f48,%f50,%f52
! Add the two sets of pixels together. (Performed inline).
std %f52,[%o2]
! Store the resulting pixel values to memory.
add %o3,-1,%o3
! Decrement len.
cmp %o3,0
! Test if we are finished.
bg %icc,L5
! Repeat for next set of pixels.
add %o2,8,%o2
! Advance the result location. (SPARC
branch delay slot).
L4: retl
! Done. Return from routine.
nop
! (SPARC branch delay slot).
As the resulting machine code shows, a well-designed compiler is able
to provide end users with high performance access to instruction set extensions,
such as the VIS instruction set. Equally important, the end user is able
to access such features entirely from within a C or C++ program.
|