CPU Benchmarks




Contents

Float/Int operations
MFLOPS
Dhrystones
Influence of the cache
Pointer aliasing and double dereferencing
C++ features (RTTI)


Sometimes, when I'm trying to optimize a very tight inner loop, I start wondering how long different operations of the CPU would take. Is it faster to use a ?: expression or an fabsf function call (which is hopefully a macro)? Is it faster to do a / several times or one division and then several *(1/x)?
How many MFlops and dhrystones has my CPU anyway?
What is the penalty if your data structures don't fit nicely into the cache?
What about pointer aliasing and double dereferencing?

Basic Operations

The following table shows the time for some basic operations. Most of them are actually a quadruple of operations: array index calc., fetch, operation, store. So the timings might be a little unfair on the CPU, but I hope they are more realistic, since most of the time they will be "embedded" in a similar "environment".
Well, here's what I found for a R12000 (400 MHz, IP32, 32k D-cache, 32k I-cache, 1M secondary cache), a UltraSPARC-II (296 MHz), a Dual-Pentium-III (1GHz, 512MB, 16+16k L1 cache, 256k L2 cache), and a Intel Core 2 Duo (2.2 GHz, 4 MB L2 cache, bus speed 800 MHz, MacBookPro 3,1, with power supply plugged in) :


nanoseconds (user-time)
code snippet for R12000
(400 MHz)
Pentium-III
(1 GHz, icc 7)
UltraSPARC-II
(296 MHz)
Intel Core 2 Duo
(2.2 GHz)
One operation float/double integer float double int float double int float double int float double int
comparison f[i]>0.5 a[i]!=0 16 24 16 26 44 5.5 30.0 60.0 60.0 4.2 4.5 2.5
fetch&store f[i]=f2[i] a[i]=a2[i] 11 22 10 ? ? ? 20.0 40.0 20.0 2.8 5.4 2.8
multiplication acc*=f[i] acc*=a[i] 11 15 24 5 9 5 40.0 50.0 160.0 2.2 2.8 1.4
square acc+=f[i]*f[i] acc+=a[i]*a[i] 1.5 1.7 0.9
division acc+=f[i]/f acc+=a[i]/a 40 56 59 6 10 6 40.0 90.0 220.0 1.4 1.8 5.6
addition a+=a[i] a+=a[i] 17 29 14 5 11 6 20.0 50.0 20.0 1.4 1.7 0.9
sqrt sqrt(f[i]) - 94 105 - ? ? - 140.0 120.0 - 1.6 3.2
compl. expr. f=f(..) a=a(..) 21 21 16 31 25 8 30.0 50.0 160.0 1.7 1.4 1.5
abs fabs[f](a[i]) abs(a[i]) 47 63 11 ? ? ? 140.0 60.0 40.0 0.8 1.5 0.7
? : f[i]>0?f[i]:-f[i] a[i]>0?a[i]:-a[i] 21 30 11 33 50 22 40.0 60.0 40.0 0.8 1.5 0.7
makeabs makeabs(f[i]) makeabs(a[i]) 2.1 4.2 2.1
function call f(a[i]) f(a[i]) 32 44 28 28 45 28       ? ? ?
fct call thru ptr fct_ptr(a[i]) fct_ptr(a[i]) 104 107 107 53 67 53       5.6 5.8 5.5
pow pow(f[i],1.1) 40 140

Times are in nanoseconds, and user time only.
The time is averaged over 20,000,000 loops (62,500,000 in the Intel Core 2 Duo tests).

R10000 (SGI): cc -O -n32 ifcomp.c -o ifcomp -lm (IRIX 6.5).
R12000: same as R10000.
Sun: cc -fast ifcomp.c -o ifcomp
Pentium4: icc -O3 -march=pentiumiii -ip -rcd -unroll ifcomp.c -o ifcomp -lm . (Intel compiler icc version 7.0)
Intel Core 2 Duo: gcc -O3 -funroll-loops -ffast-math -mfpmath=sse ifcomp.c -o ifcomp -lm .

Of course, in all timings I tried to eliminate all CPU time that is caused by the loop construct itself and any auxiliary operations!
And, of course, I tried to make sure that the compiler couldn't "optimize away" some of the code.

A "-" (minus) in the table above usually means that the operation is not applicable for that type.
An empty cell means that I haven't timed that operation on that platform (this happened usually, when I introduced the test later).
A "?" means that the time reported by the program seemed to be bogus (maybe the compiler did still "optimize away" that part).

The makeabs is defined as makeabs(X) = *((unsigned int *)(&X)) &= 0x7fffffff .

Here is the C code I have used. It actually tests a few more operations than shown in the above table.

MFlops

The following table shows the performance in terms of MFLOPs.
The test program is organized in several modules.
MFLOPs
Module R10000 UltraSPARC-II Pentium-III Pentium/eg G3 R5000 R12000
1 162 188 420 136 94 82 267
2 86 121 244 227 62 43 148
3 456 299 506 185 154 179 757
4 401 260 470 200 107 165 708
5 345 215 335 266 117 164 553
6 440 288 471 355 185 165 726
7 46 53 118 77 32 23 76
8 435 290 465 397 181 153 706

Compile options:
R10000 (250 MHz): -n32 -O3
R12000 (400 MHz): -n32 -O3
UltraSPARC-II (296MHz): -fast -xO4
Pentium-III (dual, 1 GHz): icl -DMSC -O2 -G7 flops.c
Pentium/eg: dual-Pentium-III, 866MHz, compiled with egcs 2.91.66, Flags: -O3, run under Linux 2.2.14-SMP, 256k Cache, 133 MHz Bus
G3 (MHz):
R5000: same as R10000.
Note: the program is not threaded, so it doesn't make use of any extra CPU's or cores.
Jonathan Leto has made
more benchmarks with the same program on different PC's.

The benchmark program I used for the table above is called flops.c, written by Al Aburto (version 2.0, Dec 1992). If you want to try the benchmark on your CPU, here's the source code. I can't remember where I got it from ...

Dhrystones

R10000 UltraSPARC-II Pentium-II G3
Compiled (1) (2) (3) (4) (5) (6)
Dhrystones 834,724 2,049,180 530,973 520,345 541,711 1,000,000

The program was compiled to do 5000000 passes.
R10000 250 MHz (IRIX 6.5):
(1) cc -O -n32 dhrystone.c
(2) cc -n32 -O3 -IPA -OPT:fast_sqrt=ON:alias=restrict:roundoff=3:fast_exp=ON:fast_io=ON -OPT:ptr_opt=ON:unroll_size=1000:unroll_times_max=8 dhrystone.c
UltraSPARC-II 296 MHz: (3) cc -fast dhrystone.c
Pentium-II 233 MHz (BeOS):
(4) gcc -O3 -m486 dhrystone.c
(5) gcc -O3 -m486 -fomit-frame-pointer dhrystone.c (BeOS)
(6) G3 :

Here's the source of the benchmark program (I think, I got it from wuarchive).

Cache performance

The following graphs show what happens when you operate with tight loops on arrays as they get larger. In this case, the operation is just an assignment of array elements.
The first experiment generates random, uniformly distributed indices i in the range [0,n-1], where n is a power of 2; n is the size of the array. The operation was a[i] = a[i+1], where a is an array of floats.
The second experiment just runs sequentially several times through the array and does the same assignment as in the first experiment.
The graph shows the average time for one float load/save with respect to the array's size in bytes on different CPUs:

Obviously, a number of questions arise. Also, I believe that the sequential cache test failed on the Intel Core 2 Duo, because there seem to be no cache effects at all, which seems implausible. I suspect that the compiler (gcc) has optimized away something, or it was smart enough to do prefetching (even though I used only -O as optimization option). If you want to look at the
C code I have used, here it is. For any comments, suggestions, results, answers to my questions, please send me email to zach tu-clausthal.de !

Pointer aliasing

The following table shows the performance impact of pointer aliasing and of double dereferencing (via index arrays):

nanoseconds
compile options f1 f2
-O2 -n32 950 590
-O3 -n32 -INLINE:must=f1:must=f2 760 340
-O3 -n32 -OPT:alias=restrict 760 340
-O3 -n32 -OPT:alias=restrict -INLINE:must=f1:must=f2 530 220

The function f2() is simply

	for ( i = 0; i < 6; i ++ )
		d[i] = c[i][0] * e[i] + c[i][1] * e[i] + c[i][2] * e[i];
while f1() is the same except with an index array:
	for ( i = 0; i < 6; i ++ )
		d[i] = c[i][0] * e[ Bb[i][0] ] + c[i][1] * e[ Bb[i][1] ] + c[i][2] * e[ Bb[i][2] ];
If you would like to have a closer look at the
source or try it for yourself, here is it!

C++ features

nanoseconds (user-time)
One operation code snippet R10000 (250MHz)
static cast static_cast<Derived*>(base_ptr) 0.8
dynamic cast (4 levels) dynamic_cast<Derived4*>(base_ptr) 1.0
typeid "by hand" base_ptr->id() != Derived::Id() 0.95
typeid/RTTI typeid(*base_ptr)!=typeid(Derived) 0.90
typeid.before typeid(*base_ptr).before(typeid(Derived2)) 0.80

Here is the source of the little benchmark program.
Gabriel Zachmann
Last modified: Sun Nov 23 15:46:34 MET 2008