nanoseconds (user-time) | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
code snippet for | R12000 (400 MHz) |
Pentium-III (1 GHz, icc 7) |
UltraSPARC-II (296 MHz) |
Intel Core 2 Duo (2.2 GHz) |
|||||||||||||
One operation | float/double | integer | float | double | int | float | double | int | float | double | int | float | double | int | |||
comparison | f[i]>0.5 | a[i]!=0 | 16 | 24 | 16 | 26 | 44 | 5.5 | 30.0 | 60.0 | 60.0 | 4.2 | 4.5 | 2.5 | |||
fetch&store | f[i]=f2[i] | a[i]=a2[i] | 11 | 22 | 10 | ? | ? | ? | 20.0 | 40.0 | 20.0 | 2.8 | 5.4 | 2.8 | |||
multiplication | acc*=f[i] | acc*=a[i] | 11 | 15 | 24 | 5 | 9 | 5 | 40.0 | 50.0 | 160.0 | 2.2 | 2.8 | 1.4 | |||
square | acc+=f[i]*f[i] | acc+=a[i]*a[i] | 1.5 | 1.7 | 0.9 | ||||||||||||
division | acc+=f[i]/f | acc+=a[i]/a | 40 | 56 | 59 | 6 | 10 | 6 | 40.0 | 90.0 | 220.0 | 1.4 | 1.8 | 5.6 | |||
addition | a+=a[i] | a+=a[i] | 17 | 29 | 14 | 5 | 11 | 6 | 20.0 | 50.0 | 20.0 | 1.4 | 1.7 | 0.9 | |||
sqrt | sqrt(f[i]) | - | 94 | 105 | - | ? | ? | - | 140.0 | 120.0 | - | 1.6 | 3.2 | ||||
compl. expr. | f=f(..) | a=a(..) | 21 | 21 | 16 | 31 | 25 | 8 | 30.0 | 50.0 | 160.0 | 1.7 | 1.4 | 1.5 | |||
abs | fabs[f](a[i]) | abs(a[i]) | 47 | 63 | 11 | ? | ? | ? | 140.0 | 60.0 | 40.0 | 0.8 | 1.5 | 0.7 | |||
? : | f[i]>0?f[i]:-f[i] | a[i]>0?a[i]:-a[i] | 21 | 30 | 11 | 33 | 50 | 22 | 40.0 | 60.0 | 40.0 | 0.8 | 1.5 | 0.7 | |||
makeabs | makeabs(f[i]) | makeabs(a[i]) | 2.1 | 4.2 | 2.1 | ||||||||||||
function call | f(a[i]) | f(a[i]) | 32 | 44 | 28 | 28 | 45 | 28 | ? | ? | ? | ||||||
fct call thru ptr | fct_ptr(a[i]) | fct_ptr(a[i]) | 104 | 107 | 107 | 53 | 67 | 53 | 5.6 | 5.8 | 5.5 | ||||||
pow | pow(f[i],1.1) | 40 | 140 |
R10000 (SGI): cc -O -n32 ifcomp.c -o ifcomp -lm (IRIX 6.5).
R12000: same as R10000.
Sun: cc -fast ifcomp.c -o ifcomp
Pentium4: icc -O3 -march=pentiumiii -ip -rcd -unroll ifcomp.c -o ifcomp -lm . (Intel compiler icc version 7.0)
Intel Core 2 Duo: gcc -O3 -funroll-loops -ffast-math -mfpmath=sse ifcomp.c -o ifcomp -lm .
Of course, in all timings I tried to eliminate all CPU time
that is caused by the loop construct itself and any auxiliary
operations!
And, of course, I tried to make sure that the compiler couldn't "optimize
away" some of the code.
A "-" (minus) in the table above usually means that the operation
is not applicable for that type.
An empty cell means that I haven't timed that operation on that platform (this happened usually,
when I introduced the test later).
A "?" means that the time reported by the program
seemed to be bogus (maybe the compiler did still "optimize
away" that part).
The makeabs is defined as makeabs(X) = *((unsigned int *)(&X)) &= 0x7fffffff .
Here is the C code I have used. It actually tests a few more operations than shown in the above table.
Compile options:
The benchmark program I used for the table above is called flops.c,
written by Al Aburto (version 2.0, Dec 1992).
If you want to try the benchmark on your CPU, here's
the source code.
I can't remember where I got it from ...
The program was compiled to do 5000000 passes.
Here's the source of the benchmark program
(I think, I got it from wuarchive).
The function f2() is simply
MFlops
The following table shows the performance in terms of MFLOPs.
The test program is organized in several modules.
MFLOPs
Module
R10000
UltraSPARC-II
Pentium-III
Pentium/eg
G3
R5000
R12000
1
162
188
420
136
94
82
267
2
86
121
244
227
62
43
148
3
456
299
506
185
154
179
757
4
401
260
470
200
107
165
708
5
345
215
335
266
117
164
553
6
440
288
471
355
185
165
726
7
46
53
118
77
32
23
76
8
435
290
465
397
181
153
706
R10000 (250 MHz): -n32 -O3
R12000 (400 MHz): -n32 -O3
UltraSPARC-II (296MHz): -fast -xO4
Pentium-III (dual, 1 GHz): icl -DMSC -O2 -G7 flops.c
Pentium/eg: dual-Pentium-III, 866MHz, compiled with egcs 2.91.66, Flags:
-O3, run under Linux 2.2.14-SMP, 256k Cache, 133 MHz Bus
G3 (MHz):
R5000: same as R10000.
Note: the program is not threaded, so it doesn't make use of any extra CPU's or cores.
Jonathan Leto has made
more benchmarks
with the same program on different PC's.
Dhrystones
R10000
UltraSPARC-II
Pentium-II
G3
Compiled
(1)
(2)
(3)
(4)
(5)
(6)
Dhrystones
834,724
2,049,180
530,973
520,345
541,711
1,000,000
R10000 250 MHz (IRIX 6.5):
(1) cc -O -n32 dhrystone.c
(2) cc -n32 -O3 -IPA
-OPT:fast_sqrt=ON:alias=restrict:roundoff=3:fast_exp=ON:fast_io=ON -OPT:ptr_opt=ON:unroll_size=1000:unroll_times_max=8 dhrystone.c
UltraSPARC-II 296 MHz: (3) cc -fast dhrystone.c
Pentium-II 233 MHz (BeOS):
(4) gcc -O3 -m486 dhrystone.c
(5) gcc -O3 -m486 -fomit-frame-pointer dhrystone.c (BeOS)
(6) G3 :
Pointer aliasing
The following table shows the performance impact of pointer aliasing
and of double dereferencing (via index arrays):
nanoseconds
compile options
f1
f2
-O2 -n32
950
590
-O3 -n32 -INLINE:must=f1:must=f2
760
340
-O3 -n32 -OPT:alias=restrict
760
340
-O3 -n32 -OPT:alias=restrict -INLINE:must=f1:must=f2
530
220
for ( i = 0; i < 6; i ++ )
d[i] = c[i][0] * e[i] + c[i][1] * e[i] + c[i][2] * e[i];
while f1() is the same except with an index array:
for ( i = 0; i < 6; i ++ )
d[i] = c[i][0] * e[ Bb[i][0] ] + c[i][1] * e[ Bb[i][1] ] + c[i][2] * e[ Bb[i][2] ];
If you would like to have a closer look at the
source or try
it for yourself, here is it!
C++ features
nanoseconds (user-time)
One operation
code snippet
R10000 (250MHz)
static cast
static_cast<Derived*>(base_ptr)
0.8
dynamic cast (4 levels)
dynamic_cast<Derived4*>(base_ptr)
1.0
typeid "by hand"
base_ptr->id() != Derived::Id()
0.95
typeid/RTTI
typeid(*base_ptr)!=typeid(Derived)
0.90
typeid.before
typeid(*base_ptr).before(typeid(Derived2))
0.80
Here is the source
of the little benchmark program.