Evaluating Hardware/ Software Tradeoffs - Hardware extensions and the compilers that make them real


< Prev Contents Next >

Sun's Visual Instruction Set

Sun's UltraSPARC is a superscalar, 4th generation RISC. It can issue up to four instructions per pipeline cycle. It is also Sun's first 64-bit machine the main datapaths are 64-bits wide, ideal for graphics or multimedia data processing. It has four execution units: the Branch, Integer Execution, the Load/Store and Floating-point/Graphics Units.

With UltraSPARC Sun, architects add special graphics/math processing capabilities in the FPU, now known as the Floating-Point/Graphics Execution Unit. The FPU/GEU can execute up to two floating-point/graphics operations (FGops) and one Floating-Point load/store operation in each pipeline clock cycle. These operations, except for divide and square root, are fully pipelined. They can complete out-of-order without stalling the execution of FGops. Figure 5 shows the UltraSPARC execution units, including the FPU/GEU and its five sub-units. The FPU/GEC Unit works with its own set of 32 double-precision floating-point registers. It also has 4 sets of FP condition code registers for more parallelism.

Figure 5:  UltraSPARC execution units

In processing image data, the code uses the UltraSPARC 32 integer registers for addressing the data, and the 32 FPU registers for manipulating images. Pixel information is typically stored as four 8-bit or 16-bit integer values. Typically these four values represent red, green, blue (RGB) and the alpha (a ) information for the pixel. Figure 6 shows the UltraSPARC V9 packed data formats.

Figure 6:  Partitioned data formats

The UltraSPARC Visual Instruction Set instructions include: pixel expand/pack, partitioned add/MPY/Compare, align, edge handling, array addressing, merge, pixel distance and logical operations. Figure 7 shows the addition of two 64-bit partitioned words with four 16-bit components.

Figure 7:  UltraSPARC partitioned multiply

The GNU assembler has a very modular structure. Adding the new Sun UltraSPARC VIS instructions involved little more than adding a few operands to the opcode table. The VIS graphics instructions are not represented in the compiler. Most programmers use these operations in assembler, typically in libraries and special utilities. However, GCC's in-line capability enables developers to deploy the VIS multi-field processing features in key inner loop routines and functions.

The assembler enables developers to write assembly code that uses instructions such as the following example's input.

             fpadd16    %f2,%f4,%f6

This instruction adds four 16-bit floating bit values in each of f2 and f4 and store the result into f6. Inside a C function, we can use GCC's inline assembly feature with the following input.

    typedef double pixel4x16;
    static inline pixel4x16 fpadd16 (pixel4x16 a, pixel4x16 b)
    {
        pixel4x16 result;
        asm("fpadd16 %1, %2, %0", : "=e" (result) : "e" (a), "e" (b));
        return result;
    }

This function will add together two sets of four 16-bit pixel values. If we have a frame buffer, we can then take advantage of GCC's ability to inline functions to write the following input.

    vector_fpadd16 (pixel4x16 *vec1, pixel4x16 *vec2, pixel4x16 *res, int len)
    {
        while (len-- > 0) {
            *res++ = fpadd16 (*vec1++, *vec2);
        }
    }

This routine will add together two vectors, or equivalently, two regions of memory, comprised of pixel values. The code generated by the compiler for this routine looks like the following output.

  vector_fpadd16:
      cmp      %o3,0           ! Check that len is not initially zero or negative.
      ble      %icc,L4         ! Exit early if so.
      nop                      ! (SPARC branch delay slot).
  L5: ldd      [%o1],%f50      ! Get first set of four 16-bit pixel values.
      add      %o1,8,%o1       ! Advance first set to next group of pixels.
      ldd      [%o0],%f48      ! Get second set of pixel values.
      add      %o0,8,%o0       ! Advance second set to next group of pixels
      fpadd16  %f48,%f50,%f52  ! Add the two sets of pixels together. (Performed inline).
      std      %f52,[%o2]      ! Store the resulting pixel values to memory.
            add      %o3,-1,%o3      ! Decrement len.
      cmp      %o3,0           ! Test if we are finished.
      bg       %icc,L5         ! Repeat for next set of pixels.
      add      %o2,8,%o2       ! Advance the result location. (SPARC branch delay slot).
  L4: retl                     ! Done. Return from routine.
      nop                      ! (SPARC branch delay slot).

As the resulting machine code shows, a well-designed compiler is able to provide end users with high performance access to instruction set extensions, such as the VIS instruction set. Equally important, the end user is able to access such features entirely from within a C or C++ program.


< Prev Contents Next >