[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: dgemm benchmarks for axp Linux and Dec Unix



Here is an updated comparison of three different dgemm
(double precision matrix-matrix multiply) packages
on axp Linux and Dec Unix.  A few problems in the benchmark
code have been fixed, particularly the roundoff
problem imposed by a 1 second resolution in the Fortran
time() call.  This version implements the dtime(tarray)
call with a resolution of 1/60 second, good enough
for a five second test.  The most notable
result of this fix is a lowering of the DEC DXML numbers. 
Matrices are 1012x1012 for the
public domain (netlib) dgemm, and 3000x3000 for the assembler
dgemm.

Results for two versions of 
Kazushige Goto's assembler dgemm code
are shown.  This code is 15-30% faster
than DEC's own DXML dgemm for alligned problems.
It's remarkable.  
Kazushige Goto's address: <goto@statabo.rim.or.jp>

The 625 MHz alpha is a 6 cpu Dec 8400 (21164A) running Dec Unix.
On chip: 8 Kbyte instruction cache, 8 Kbyte data cache, 96-Kbyte 
write-back second level cache.  On board: 4 Mbyte third level
cache for each cpu.  Tests were run in the forground with
interactive priority while the machine was in full use running
lower priority batch jobs.

The 433 MHz machine is an Alpine Durango PC164 (21164A) running Linux.
On chip: same as for 8400.  On board: 1 Mbyte third level cache.
This machine was otherwise quiescent during the tests.

------------------------------------------------------------------------------
Tests with maximum matrix dimension = 2048 and N=M=K=1012, average of 4
runs;
standard deviations were all about 2 MFlops.
______________________________________________________________________________
MFlops - CPU MHz -- OS --- Compilers ----- BLAS
 
657. --- 625 --- Dec Unix -- Dec -------- Kazushige Goto's fast dgemm
631. --- 625 --- Dec Unix -- Dec -------- Kazushige Goto's V.1 dgemm 
575. --- 625 --- Dec Unix -- Dec -------- Dec DXML dgemm
460. --- 433 --- Linux ----- Dec -------- Kazushige Goto's fast dgemm
453. --- 433 --- Linux --- egcs-980221 -- Kazushige Goto's fast dgemm
407. --- 433 --- Linux --- egcs-980221 -- Kazushige Goto's V.1 dgemm
390. --- 433 --- Linux ----- Dec -------- Kazushige Goto's V.1 dgemm
 27. --- 625 --- Dec Unix -- Dec -------- Public Domain dgemm (netlib)
 23. --- 433 --- Dec Unix -- Dec -------- Public Domain dgemm (netlib)
 24. --- 433 --- Linux --- egcs-980221 -- Public Domain dgemm (netlib)
______________________________________________________________________________

Tests with maximum matrix dimension = 2048 and N=M=K=2000, average of 3
runs;
standard deviations are in parenthesis.
______________________________________________________________________________
MFlops - CPU MHz -- OS --- Compilers ----- BLAS
 
706(5)-- 625 --- Dec Unix -- Dec -------- Kazushige Goto's fast dgemm
668(4)-- 625 --- Dec Unix -- Dec -------- Kazushige Goto's V.1 dgemm 
548(7)-- 625 --- Dec Unix -- Dec -------- Dec DXML dgemm
456(5)-- 433 --- Linux ----- Dec -------- Kazushige Goto's fast dgemm
434(7)-- 433 --- Linux --- egcs-980221 -- Kazushige Goto's fast dgemm
429(1)-- 433 --- Linux ----- Dec -------- Kazushige Goto's V.1 dgemm
390(1)-- 433 --- Linux --- egcs-980221 -- Kazushige Goto's V.1 dgemm

------------------------------------------------------------------------------
Tests with maximum matrix dimension = 4000 and N=M=K=1012, average of 4
runs;
standard deviations are shown in parenthesis.  Paging is a factor here
on the
PC164.  Performance appears to be the same on the 8400 with N=M=2000.
______________________________________________________________________________
MF(std)- CPU MHz -- OS --- Compilers ----- BLAS
 
652(21)- 625 --- Dec Unix -- Dec -------- Kazushige Goto's fast dgemm
630(6)-- 625 --- Dec Unix -- Dec -------- Kazushige Goto's V.1 dgemm 
531(11)- 625 --- Dec Unix -- Dec -------- Dec DXML dgemm
   . --- 433 --- Linux ----- Dec -------- Kazushige Goto's fast dgemm
327(42)- 433 --- Linux --- egcs-980221 -- Kazushige Goto's fast dgemm
331(33)- 433 --- Linux --- egcs-980221 -- Kazushige Goto's V.1 dgemm
   . --- 433 --- Linux ----- Dec -------- Kazushige Goto's V.1 dgemm
______________________________________________________________________________
> 
I used a modified level 3 blas timing routine:
>       PROGRAM DB3TIM
> *  -- Written on 1-July-1988.
> *     Jeremy Du Croz and Mick Pont, NAG Central Office.
changed to use the dtime(tarray) call, 13 March, 1998.
Only transa = transb = 'N' are show.

> The assembler dgemm, written specifically for the
> 21164A, requires that matrix size
> must be aligned (a multiple of four).  It drops back
> to a somewhat tuned c language routine running at
> about half the speed of the assembler when the size
> is not aligned.  The assembler routine is specific
> to the transa = transb = 'N' case.  Cases requiring
> a transpose use the tuned c routines.
> The speed in Mflops varied significantly with the
> size of the problem.



-- 
Bob Williams, http://bob.usuf2.usuhs.mil/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index] []