cmov vs cmp+jxx across x86 CPU implementations
John Reiser
jreiser at BitWagon.com
Wed Feb 4 17:13:58 UTC 2009
Ulrich Drepper wrote:
> Dominik 'Rathann' Mierzejewski wrote:
>> I'd like to see a case (not involving Pentium 4) where using cmov is slower
>> than not using it. It definitely is faster for decoding H.264 in FFmpeg
>> for example.
>
> I don't have a specific test case. But I do talk to the CPU
> architectures at Intel regularly. They always say the cmov should be
> avoided. Especially with the introduction of the fused micro-ops the
> various cmp+jcc pairs are likely move faster.
Always demand measurements. See below for seven different chips which
span a decade of implementation. Cmov is faster when the jxx branch predictor
would fail [Pentium4 NetBurst can be an exception], and cmov wins by a very
large margin on CoreDuo and Core2Duo.
> And from the code generation perspective using cmp+jcc is also more
> flexible. With cmov you have to tie up two registers. This is
> particularly bad with the x86 ABI.
The frequent case of computing minimum or maximum requires only one register:
mov m(%ebp),%eax
cmp n(%ebp),%eax
cmova n(%ebp),%eax
> There are certainly cases where cmov can be faster. Perhaps exclusively
> on older micro architectures (P4s, early Core2, maybe AMD, haven't
> checked). But in general it's no win.
Please give measurements. Mine show that the newer the chip,
the more cmov wins when the jxx branch predictor would fail.
[Core i7 untested.]
-----
User CPU time in seconds (smaller is better.)
"for i in 1 2 3 4 5; do time ./XXXXX; done"
[dual processor often reflects alternating core assignment!]
cmov2 cmp-jmp2 CPU
Family 6 Model 23 (Core2 Duo E8400; 3000MHz)
2.873 6.096
2.873 6.029
2.868 6.135
2.875 6.038
2.868 6.079
Family 15 Model 107 (Athlon64x2 4800+; 2500MHz)
3.182 4.433
3.529 4.433
3.184 4.432
3.543 4.437
3.182 4.428
Family 15 Model 47 (Athlon64 3200+; 2000MHz)
3.914 5.530
3.913 5.529
3.913 5.532
3.911 5.533
3.915 5.530
Family 6 Model 14 (CoreDuo 1300 [not Core2]; 1666MHz)
4.746 10.638
4.716 10.658
4.723 10.630
4.705 10.666
4.705 10.657
Family 15 Model 2 (Pentium4 Northwood; 1600MHz)
12.081 11.129
12.089 11.137
12.081 11.133
12.081 11.225
12.081 11.165
Family 6 Model 7 (AMD Duron 1200MHz)
11.894 13.370
11.939 13.322
11.912 13.358
11.814 13.320
11.913 13.379
Family 6 Model 8 (PentiumIII Coppermine; 700MHz)
16.300 16.383
16.058 16.061
16.054 16.054
16.058 16.055
16.052 16.052
-----
----- cmov2.S; gcc -o cmov2 -nostartfiles -nostdlib cmov2.S
.balign 64
sub1:
mov -4(%ebp),%eax
cmp -8(%ebp),%eax
cmova -8(%ebp),%eax
ret
_start: .globl _start
nop
and $~0<<6,%esp
mov %esp,%ebp
sub $4*4,%esp
mov $0x10000000 -1,%ecx
mov $1,%esi
mov $2,%edi
jmp top
.balign 64
top:
mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1
mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1
mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1; call sub1
mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1; call sub1
sub $1,%ecx; jnc top
sub %ebx,%ebx
mov $1,%eax
int $0x80
/* EOF */
-----
----- cmp-jmp2.S; gcc -o cmp-jmp2 -nostartfiles -nostdlib cmp-jmp2.S
.balign 64
sub1:
mov -4(%ebp),%eax
cmp -8(%ebp),%eax; jbe 0f
mov -8(%ebp),%eax
0:
ret
_start: .globl _start
nop
and $~0<<6,%esp
mov %esp,%ebp
sub $4*4,%esp
mov $0x10000000 -1,%ecx
mov $1,%esi
mov $2,%edi
jmp top
.balign 64
top:
mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1
mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1
mov %esi,-4(%ebp); mov %edi,-8(%ebp); call sub1; call sub1
mov %esi,-8(%ebp); mov %edi,-4(%ebp); call sub1; call sub1
sub $1,%ecx; jnc top
sub %ebx,%ebx
mov $1,%eax
int $0x80
/* EOF */
-----
--
John Reiser, jreiser at BitWagon.com
More information about the fedora-devel-list
mailing list