Benchmarks of some CPUs

CPU Benchmarks

Float/Int operations
MFLOPS
Dhrystones
Influence of the cache
Pointer aliasing and double dereferencing
C++ features (RTTI)

Sometimes, when I'm trying to optimize a very tight inner loop, I start wondering how long different operations of the CPU would take. Is it faster to use a ?: expression or an fabsf function call (which is hopefully a macro)? Is it faster to do a / several times or one division and then several *(1/x)?
How many MFlops and dhrystones has my CPU anyway?
What is the penalty if your data structures don't fit nicely into the cache?
What about pointer aliasing and double dereferencing?

Basic Operations

The following table shows the time for some basic operations. Most of them are actually a quadruple of operations: array index calc., fetch, operation, store. So the timings might be a little unfair on the CPU, but I hope they are more realistic, since most of the time they will be "embedded" in a similar "environment".
Well, here's what I found for a R12000 (400 MHz, IP32, 32k D-cache, 32k I-cache, 1M secondary cache), a UltraSPARC-II (296 MHz), a Dual-Pentium-III (1GHz, 512MB, 16+16k L1 cache, 256k L2 cache), and a Intel Core 2 Duo (2.2 GHz, 4 MB L2 cache, bus speed 800 MHz, MacBookPro 3,1, with power supply plugged in) :

			nanoseconds (user-time)
	code snippet for		R12000 (400 MHz)			Pentium-III (1 GHz, icc 7)			UltraSPARC-II (296 MHz)			Intel Core 2 Duo (2.2 GHz)
One operation	float/double	integer	float	double	int	float	double	int	float	double	int	float	double	int
comparison	f[i]>0.5	a[i]!=0	16	24	16	26	44	5.5	30.0	60.0	60.0	4.2	4.5	2.5
fetch&store	f[i]=f2[i]	a[i]=a2[i]	11	22	10	?	?	?	20.0	40.0	20.0	2.8	5.4	2.8
multiplication	acc*=f[i]	acc*=a[i]	11	15	24	5	9	5	40.0	50.0	160.0	2.2	2.8	1.4
square	acc+=f[i]*f[i]	acc+=a[i]*a[i]										1.5	1.7	0.9
division	acc+=f[i]/f	acc+=a[i]/a	40	56	59	6	10	6	40.0	90.0	220.0	1.4	1.8	5.6
addition	a+=a[i]	a+=a[i]	17	29	14	5	11	6	20.0	50.0	20.0	1.4	1.7	0.9
sqrt	sqrt(f[i])	-	94	105	-	?	?	-	140.0	120.0	-	1.6	3.2
compl. expr.	f=f(..)	a=a(..)	21	21	16	31	25	8	30.0	50.0	160.0	1.7	1.4	1.5
abs	fabs[f](a[i])	abs(a[i])	47	63	11	?	?	?	140.0	60.0	40.0	0.8	1.5	0.7
? :	f[i]>0?f[i]:-f[i]	a[i]>0?a[i]:-a[i]	21	30	11	33	50	22	40.0	60.0	40.0	0.8	1.5	0.7
makeabs	makeabs(f[i])	makeabs(a[i])										2.1	4.2	2.1
function call	f(a[i])	f(a[i])	32	44	28	28	45	28				?	?	?
fct call thru ptr	fct_ptr(a[i])	fct_ptr(a[i])	104	107	107	53	67	53				5.6	5.8	5.5
pow	pow(f[i],1.1)											40	140

Times are in nanoseconds, and user time only.
The time is averaged over 20,000,000 loops (62,500,000 in the Intel Core 2 Duo tests).

R10000 (SGI): cc -O -n32 ifcomp.c -o ifcomp -lm (IRIX 6.5).
R12000: same as R10000.
Sun: cc -fast ifcomp.c -o ifcomp
Pentium4: icc -O3 -march=pentiumiii -ip -rcd -unroll ifcomp.c -o ifcomp -lm . (Intel compiler icc version 7.0)
Intel Core 2 Duo: gcc -O3 -funroll-loops -ffast-math -mfpmath=sse ifcomp.c -o ifcomp -lm .

Of course, in all timings I tried to eliminate all CPU time that is caused by the loop construct itself and any auxiliary operations!
And, of course, I tried to make sure that the compiler couldn't "optimize away" some of the code.

A "-" (minus) in the table above usually means that the operation is not applicable for that type.
An empty cell means that I haven't timed that operation on that platform (this happened usually, when I introduced the test later).
A "?" means that the time reported by the program seemed to be bogus (maybe the compiler did still "optimize away" that part).

The makeabs is defined as makeabs(X) = *((unsigned int *)(&X)) &= 0x7fffffff .

Here is the C code I have used. It actually tests a few more operations than shown in the above table.

MFlops

The following table shows the performance in terms of MFLOPs.
The test program is organized in several modules.

	MFLOPs
Module	R10000	UltraSPARC-II	Pentium-III	Pentium/eg	G3	R5000	R12000
1	162	188	420	136	94	82	267
2	86	121	244	227	62	43	148
3	456	299	506	185	154	179	757
4	401	260	470	200	107	165	708
5	345	215	335	266	117	164	553
6	440	288	471	355	185	165	726
7	46	53	118	77	32	23	76
8	435	290	465	397	181	153	706

Compile options:
R10000 (250 MHz): -n32 -O3
R12000 (400 MHz): -n32 -O3
UltraSPARC-II (296MHz): -fast -xO4
Pentium-III (dual, 1 GHz): icl -DMSC -O2 -G7 flops.c
Pentium/eg: dual-Pentium-III, 866MHz, compiled with egcs 2.91.66, Flags: -O3, run under Linux 2.2.14-SMP, 256k Cache, 133 MHz Bus
G3 (MHz):
R5000: same as R10000.
Note: the program is not threaded, so it doesn't make use of any extra CPU's or cores.
Jonathan Leto has made more benchmarks with the same program on different PC's.

The benchmark program I used for the table above is called flops.c, written by Al Aburto (version 2.0, Dec 1992). If you want to try the benchmark on your CPU, here's the source code. I can't remember where I got it from ...

Dhrystones

	R10000		UltraSPARC-II	Pentium-II		G3
Compiled	(1)	(2)	(3)	(4)	(5)	(6)
Dhrystones	834,724	2,049,180	530,973	520,345	541,711	1,000,000

The program was compiled to do 5000000 passes.
R10000 250 MHz (IRIX 6.5):
(1) cc -O -n32 dhrystone.c
(2) cc -n32 -O3 -IPA -OPT:fast_sqrt=ON:alias=restrict:roundoff=3:fast_exp=ON:fast_io=ON -OPT:ptr_opt=ON:unroll_size=1000:unroll_times_max=8 dhrystone.c
UltraSPARC-II 296 MHz: (3) cc -fast dhrystone.c
Pentium-II 233 MHz (BeOS):
(4) gcc -O3 -m486 dhrystone.c
(5) gcc -O3 -m486 -fomit-frame-pointer dhrystone.c (BeOS)
(6) G3 :

Here's the source of the benchmark program (I think, I got it from wuarchive).

Cache performance

The following graphs show what happens when you operate with tight loops on arrays as they get larger. In this case, the operation is just an assignment of array elements.
The first experiment generates random, uniformly distributed indices i in the range [0,n-1], where n is a power of 2; n is the size of the array. The operation was a[i] = a[i+1], where a is an array of floats.
The second experiment just runs sequentially several times through the array and does the same assignment as in the first experiment.
The graph shows the average time for one float load/save with respect to the array's size in bytes on different CPUs:

Obviously, a number of questions arise. Also, I believe that the sequential cache test failed on the Intel Core 2 Duo, because there seem to be no cache effects at all, which seems implausible. I suspect that the compiler (gcc) has optimized away something, or it was smart enough to do prefetching (even though I used only -O as optimization option).

If you want to look at the C code I have used, here it is. For any comments, suggestions, results, answers to my questions, please send me email to zach tu-clausthal.de !

Pointer aliasing

The following table shows the performance impact of pointer aliasing and of double dereferencing (via index arrays):

	nanoseconds
compile options	f1	f2
-O2 -n32	950	590
-O3 -n32 -INLINE:must=f1:must=f2	760	340
-O3 -n32 -OPT:alias=restrict	760	340
-O3 -n32 -OPT:alias=restrict -INLINE:must=f1:must=f2	530	220

The function f2() is simply

	for ( i = 0; i < 6; i ++ )
		d[i] = c[i][0] * e[i] + c[i][1] * e[i] + c[i][2] * e[i];

while f1() is the same except with an index array:

	for ( i = 0; i < 6; i ++ )
		d[i] = c[i][0] * e[ Bb[i][0] ] + c[i][1] * e[ Bb[i][1] ] + c[i][2] * e[ Bb[i][2] ];

If you would like to have a closer look at the source or try it for yourself, here is it!

C++ features

		nanoseconds (user-time)
One operation	code snippet	R10000 (250MHz)

static cast	static_cast<Derived*>(base_ptr)	0.8
dynamic cast (4 levels)	dynamic_cast<Derived4*>(base_ptr)	1.0
typeid "by hand"	base_ptr->id() != Derived::Id()	0.95
typeid/RTTI	typeid(*base_ptr)!=typeid(Derived)	0.90
typeid.before	typeid(*base_ptr).before(typeid(Derived2))	0.80

Here is the source of the little benchmark program.

Gabriel Zachmann
Last modified: Sun Nov 23 15:46:34 MET 2008