P4s, Athlons and bandwidth

Wed Aug 13 22:43:45 UTC 2003

On Wed, 2003-08-13 at 22:50, Jakub Jelinek wrote:
> On Wed, Aug 13, 2003 at 10:23:01PM +0200, Jean Francois Martinez wrote:
> > Given that most/all of the recent boxes (ie the ones doing the real
> > work) are P4s and Athlons it is time RedHat stopped compiling
> > with -mcpu=i686 and started optimizing for the P4: -mcpu=p4
> 
> RHL glibc is compiled with -march=i686 actually, and there are not
> many instructions other than those enabled by -mfpmath=sse
> which would the compiler generate for normal code with -march=pentiumiii
> and not -march=i686 (the only difference is scheduling and to my knowledge
> the difference is not very big between i686 and PIII).
> -mfpmath=sse is not usable for libm, because glibc on IA-32 relies
> on extended precision in several places.
> Scheduling difference between P4 and i686 is bigger, but I don't think
> that code runs that well on Athlons.

I think I was unclear: for normal packages I advocate the use of
-mcpu=p4 given that the PII/PIII boxes are being phased out and 
still more when we consider boxes where user requires high performance

I will do some benchamrk on athlon to see which code runs better on
the Athlon but given that P4s outnumber athlons I would tend to say
to "hell with athlons" despite having one.

> 
> > Another point is that there is no such thing like low-level glibc
> > functions for the P4 and the Athlon.  The highest targetted
> > processor is the PIII.  However documents in AMD's web site show
> > that moving data (ie memcpy and friends) can be made several times
> > faster if using 3DNow instructions and data prefetching, I gave only
> > a cursory glance to the assembler parts of glibc but it didn't look
> > like those parts (targetting the PIII) would be even remotely ideal
> > for the Athlon.  Same thing about the P4.
> 
> Where have you seen PIII optimized assembly in glibc? AFAIK there is none.
> P4/Athlon/PIII optimized stringops are certainly welcome (patches to
> libc-alpha at sources.redhat.com), but bear in mind that any use of floating
> point regs (SSE/SSE2/whatever) has quite a big price in lazy FPU saving
> environment. Another thing to keep in mind is what are typical arguments
> to these functions.
> 

I should have told i686 instead of PIII.  There are 686-specific
stringops in glibc but from distant memories they target the 
"minimal" processor in the family ie PPro and therefore don't use
MMX or SSE.

There are stringops specific for the PPro family, aka i686, in glibc.  

For thh argument to functions: strcpy and friends tend to act on small
volumes but I am not so sure this is also true about the memcpy family.

The following is from distant memories so I can be incorrect:  1) AMD
doc says that shifthing between 3DNow and FP instructions bears near
zero penalty on the Athlon (ciontrarly to the K6 where it was very
slow).  I suppose the same is true for P4 versus PII/PIII (penalty
much smaller on the newer family)
2) There is a nice performance bonus reaped by use of an Athlon
instruction who prefetchs data in cache.

Now, Redhat has more than one way to get P4-optimized stringops in
the glibc.  One of them is have the idea suggested by one of the
RedHat people who are involved in glibc maintenance.  Another one
is go to AMD, tell them how bad that the Athlon64 performance being
hampered by a few, relatively small, but critically important,
functions.  When you get them, go to Intel and tell them how bad
is that the Athlon64 is running circles around the P4 due to
lack of optimized versions of these functions.  :-)

-- 
Jean Francois Martinez <jfm512 at free.fr>